Recent breakthroughs in artificial intelligence (AI) have led to the development of large language models (LLMs), sophisticated natural language processing systems developed using vast amounts of data8. These chatbots, fueled by LLMs, embody a generative AI application exemplified by ChatGPT, capable of producing responses in a conversational manner based on user inputs9. Within two months of its release in November 2022, ChatGPT surpassed 100 million users10. The ability to actively interact with models like ChatGPT makes LLMs attractive tools in various fields, including medicine11. LLMs could potentially assist in various areas of medicine, given their capability to process complex concepts and respond to diverse requests and questions (prompts)12,13. An initial conjecture proposed the potential application of this technology within clinical decision support systems. However, previous research has posited that LLMs are currently immature, prone to errors, and exhibit instability in their outputs, which significantly limits their utilization in clinical workflows14,15. Nonetheless, several studies have demonstrated the commendable performance of AI chatbots in interactive dialogues. It has been observed that these chatbots exhibit a remarkable capability for clinical decision support based on input content16,17. Notably, some research has revealed that chatbots can accurately respond to clinical queries posed on social media platforms, often exhibiting greater empathy than physician responses18,19. Nevertheless, the potential of chatbots in clinical settings remains largely untapped. In this study, we explore the ability of chatbots to interpret MRI reports for patients without a medical background and identify the key elements relevant to disease management.
We collected MRI reports from 6,174 tumor patients who underwent scans between January 1, 2019, and December 31, 2024, at three locations. These reports, varying in length and complexity, were authored by radiologists specializing in multiple anatomical systems, including but not limited to the nervous, digestive, and urinary systems. Each report meticulously detailed the normal anatomical structures of the scanned regions, the abnormal lesion signals observed, and provided a concise preliminary diagnosis. Two independent reviewers analyzed the original MRI reports alongside corresponding scans to classify findings into benign, atypical, or malignant categories (Table 1). In cases of disagreement, a third oncologist made the final determination. To preserve the authenticity of the data, no alterations were made to the content of the original reports. Additionally, all identifiable information, such as patient details, examination dates, registration numbers, and physician names, was anonymized to protect patient confidentiality.
This study utilized two chatbots: GPT o1-preview (developed by OpenAI), referred to hereafter as Chatbot 1, and Deepseek-R1 (developed by DeepSeek), designated as Chatbot 2. The study period spanned from February 1 to March 31, 2025. During this period, the models were engaged with specific queries derived from the original reports. To standardize readability comparisons, all original MRI reports and submitted queries were exclusively in English. The chatbots were tasked with sequentially answering four questions: first, to interpret the reports in a manner understandable to patients without a medical background; second, to classify the lesions as benign, atypical, or malignant; third, to assess the necessity of surgical intervention; and fourth, to recommend a treatment plan based on the report's content (Table 2). To minimize bias, a new chat session was initiated for each report analyzed. The responses from the chatbots to each prompt were documented.
In this study, readability assessments for both the original MRI reports and the explanatory reports generated by the chatbots were conducted using an online tool available at https://www.webfx.com/tools/read-able/. Three widely recognized readability indices were primarily calculated: the Flesch-Kincaid Reading Ease (FRE) score, the Flesch-Kincaid Grade Level (FKGL), and the Gunning Fog Score (GFS). The readability assessments were completed during the study period, which spanned from February 1 to March 31, 2025. Subsequently, the responses provided by the chatbots underwent medical review. Each response thread generated by the chatbots was independently evaluated by two medical reviewers. In cases of disagreement, a third oncologist was consulted to adjudicate the discrepancy.
The medical review of the explanatory reports, i.e., responses to the first question, categorizes findings into four distinct levels: correct, partially correct, partially incorrect, and incorrect. "Correct" indicates that all content from the original report is accurately included without errors. "Partially correct" refers to omissions of details that do not affect patient management, such as failure to describe normal variations in sulci and gyri. "Partially incorrect" includes errors that slightly affect patient management, such as minor inaccuracies in describing tumor size or shape that are not significant enough to alter the diagnosis or treatment recommendations. "Incorrect" signifies errors that significantly impact patient management, such as misdescribing the tumor's location.
The second and third questions aim to classify the nature of the tumor and determine the necessity of surgical intervention, respectively. For these classification tasks, the outcomes are assessed as either correct or incorrect. Additionally, reviewers employed the Likert scale to rate the quality of the treatment suggestions and the empathy demonstrated by the chatbots during the response process, on a scale from 1 (very poor) to 5 (excellent).
The Academic Ethics Review Committee of Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, provided an exemption from ethical review for this cross-sectional study (Exemption No. SZ-3192). All data utilized in this study were de-identified to ensure the privacy and confidentiality of human subjects. The Institutional Review Board of Peking Union Medical College Hospital, Chinese Academy of Medical Sciences waived the requirement for the original informed consent and permitted secondary analysis without additional consent. The study adhered to the principles of the Declaration of Helsinki and followed the guidelines of the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE).
The Friedman test was employed to compare the readability differences between the original report and two chatbot-generated explanatory reports. The Wilcoxon signed-rank test was used to compare the readability, quality of therapeutic recommendations, and empathy among the two chatbot-generated reports. Additionally, the Chi-square test was applied to evaluate the differences in medical review performance between the two chatbots. All statistical analyses were conducted using SPSS software (version 26.0, IBM), and a two-tailed p-value of less than 0.05 was considered statistically significant.