The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Thu, 2 Jan, 4:01 PM UTC
3 Sources
[1]
How good are AI doctors at medical conversations?
Artificial intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories and even providing preliminary diagnoses. These tools, known as large-language models, are already being used by patients to make sense of their symptoms and medical tests results. But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world? Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University. For their analysis, published Jan. 2 in Nature Medicine, the researchers designed an evaluation framework -- or a test -- called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large-language models to see how well they performed in settings closely mimicking actual interactions with patients. All four large-language models did well on medical exam-style questions, but their performance worsened when engaged in conversations more closely mimicking real-world interactions. This gap, the researchers said, underscores a two-fold need: First, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnosis based on more realistic interactions before they are deployed in the clinic. Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in clinic. "Our work reveals a striking paradox -- while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit," said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. "The dynamic nature of medical conversations -- the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms -- poses unique challenges that go far beyond answering multiple choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy." A better test to check AI's real-world performance Right now, developers test the performance of AI models by asking them to answer multiple choice medical questions, typically derived from the national exam for graduating medical students or from tests given to medical residents as part of their certification. "This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far messier," said study co-first author Shreya Johri, a doctoral student in the Rajpurkar Lab at Harvard Medical School. "We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform." CRAFT-MD was designed to be one such more realistic gauge. To simulate real-world interactions, CRAFT-MD evaluates how well large-language models can collect information about symptoms, medications, and family history and then make a diagnosis. An AI agent is used to pose as a patient, answering questions in a conversational, natural style. Another AI agent grades the accuracy of final diagnosis rendered by the large-language model. Human experts then evaluate the outcomes of each encounter for ability to gather relevant patient information, diagnostic accuracy when presented with scattered information, and for adherence to prompts. The researchers used CRAFT-MD to test four AI models -- both proprietary or commercial and open-source ones -- for performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties. All AI models showed limitations, particularly in their ability to conduct clinical conversations and reason based on information given by patients. That, in turn, compromised their ability to take medical histories and render appropriate diagnosis. For example, the models often struggled to ask the right questions to gather pertinent patient history, missed critical information during history taking, and had difficulty synthesizing scattered information. The accuracy of these models declined when they were presented with open-ended information rather than multiple choice answers. These models also performed worse when engaged in back-and-forth exchanges -- as most real-world conversations are -- rather than when engaged in summarized conversations. Recommendations for optimizing AI's real-world performance Based on these findings, the team offers a set of recommendations both for AI developers who design AI models and for regulators charged with evaluating and approving these tools. These include: Additionally, the evaluation should include both AI agents and human experts, the researchers recommend, because relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outpaced human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15-16 hours of expert evaluation. In contrast, human-based approaches would require extensive recruitment and an estimated 500 hours for patient simulations (nearly 3 minutes per conversation) and about 650 hours for expert evaluations (nearly 4 minutes per conversation). Using AI evaluators as first line has the added advantage of eliminating the risk of exposing real patients to unverified AI tools. The researchers said they expect that CRAFT-MD itself will also be updated and optimized periodically to integrate improved patient-AI models. "As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically," said study co-senior author Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. "CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care." Authorship, funding, disclosures Additional authors included Jaehwan Jeong and Hong-Yu Zhou, Harvard Medical School; Benjamin A. Tran, Georgetown University; Daniel I. Schlessinger, Northwestern University; Shannon Wongvibulsin, University of California-Los Angeles; Leandra A. Barnes, Zhuo Ran Cai and David Kim, Sandford University; and Eliezer M. Van Allen, Dana-Farber Cancer Institute. The work was supported by the HMS Dean's Innovation Award and a Microsoft Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. SJ received further support through the IIE Quad Fellowship. Daneshjou reported receiving personal fees from DWA, personal fees from Pfizer, personal fees from L'Oreal, personal fees from VisualDx, stock options from MDAlgorithms and Revea outside the submitted work, and a patent for TrueImage pending. Schlessinger is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant with Appiell Inc and LuminDx, and an investigator for Abbvie and Sanofi. Van Allen serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research, Serinus Bio. E.M.V.A provides research support to Novartis, BMS, Sanofi, NextPoint. Van Allen holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio, Syapse. Van Allen has filed for institutional patents on chromatin mutations and immunotherapy response, and methods for clinical interpretation; intermittent legal consulting on patents for Foaley & Hoag, and serves on the editorial board of Science Advances.
[2]
AI models struggle in real-world medical conversations
Harvard Medical SchoolJan 2 2025 Artificial intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories and even providing preliminary diagnoses. These tools, known as large-language models, are already being used by patients to make sense of their symptoms and medical tests results. But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world? Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University. For their analysis, published Jan. 2 in Nature Medicine, the researchers designed an evaluation framework -; or a test -; called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large-language models to see how well they performed in settings closely mimicking actual interactions with patients. All four large-language models did well on medical exam-style questions, but their performance worsened when engaged in conversations more closely mimicking real-world interactions. This gap, the researchers said, underscores a two-fold need: First, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnosis based on more realistic interactions before they are deployed in the clinic. Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in clinic. Our work reveals a striking paradox - while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit. The dynamic nature of medical conversations - the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms - poses unique challenges that go far beyond answering multiple choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy." Pranav Rajpurkar, study senior author, assistant professor of biomedical informatics at Harvard Medical School A better test to check AI's real-world performance Right now, developers test the performance of AI models by asking them to answer multiple choice medical questions, typically derived from the national exam for graduating medical students or from tests given to medical residents as part of their certification. "This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far messier," said study co-first author Shreya Johri, a doctoral student in the Rajpurkar Lab at Harvard Medical School. "We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform." CRAFT-MD was designed to be one such more realistic gauge. To simulate real-world interactions, CRAFT-MD evaluates how well large-language models can collect information about symptoms, medications, and family history and then make a diagnosis. An AI agent is used to pose as a patient, answering questions in a conversational, natural style. Another AI agent grades the accuracy of final diagnosis rendered by the large-language model. Human experts then evaluate the outcomes of each encounter for ability to gather relevant patient information, diagnostic accuracy when presented with scattered information, and for adherence to prompts. The researchers used CRAFT-MD to test four AI models -; both proprietary or commercial and open-source ones -; for performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties. All AI models showed limitations, particularly in their ability to conduct clinical conversations and reason based on information given by patients. That, in turn, compromised their ability to take medical histories and render appropriate diagnosis. For example, the models often struggled to ask the right questions to gather pertinent patient history, missed critical information during history taking, and had difficulty synthesizing scattered information. The accuracy of these models declined when they were presented with open-ended information rather than multiple choice answers. These models also performed worse when engaged in back-and-forth exchanges -; as most real-world conversations are -; rather than when engaged in summarized conversations. Recommendations for optimizing AI's real-world performance Based on these findings, the team offers a set of recommendations both for AI developers who design AI models and for regulators charged with evaluating and approving these tools. These include: Use of conversational, open-ended questions that more accurately mirror unstructured doctor-patient interactions in the design, training, and testing of AI tools Assessing models for their ability to ask the right questions and to extract the most essential information Designing models capable of following multiple conversations and integrating information from them Designing AI models capable of integrating textual (notes from conversations) with and non-textual data (images, EKGs) Designing more sophisticated AI agents that can interpret non-verbal cues such as facial expressions, tone, and body language Additionally, the evaluation should include both AI agents and human experts, the researchers recommend, because relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outpaced human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15-16 hours of expert evaluation. In contrast, human-based approaches would require extensive recruitment and an estimated 500 hours for patient simulations (nearly 3 minutes per conversation) and about 650 hours for expert evaluations (nearly 4 minutes per conversation). Using AI evaluators as first line has the added advantage of eliminating the risk of exposing real patients to unverified AI tools. The researchers said they expect that CRAFT-MD itself will also be updated and optimized periodically to integrate improved patient-AI models. "As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically," said study co-senior author Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. "CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care." Harvard Medical School Journal reference: Johri, S., et al. (2025) An evaluation framework for clinical use of large language models in patient interaction tasks. Nature Medicine. doi.org/10.1038/s41591-024-03328-5.
[3]
AI chatbots fail to diagnose patients by talking with them
Although popular AI models score highly on medical exams, their accuracy drops significantly when making a diagnosis based on a conversation with a simulated patient Advanced artificial intelligence models score well on professional medical exams but still flunk one of the most crucial physician tasks: talking with patients to gather relevant medical information and deliver an accurate diagnosis. "While large language models show impressive results on multiple-choice tests, their accuracy drops significantly in dynamic conversations," says Pranav Rajpurkar at Harvard University. "The models particularly struggle with open-ended diagnostic reasoning." That became evident when researchers developed a method for evaluating a clinical AI model's reasoning capabilities based on simulated doctor-patient conversations. The "patients" were based on 2000 medical cases primarily drawn from professional US medical board exams. "Simulating patient interactions enables the evaluation of medical history-taking skills, a critical component of clinical practice that cannot be assessed using case vignettes," says Shreya Johri, also at Harvard University. The new evaluation benchmark, called CRAFT-MD, also "mirrors real-life scenarios, where patients may not know which details are crucial to share and may only disclose important information when prompted by specific questions", she says. The CRAFT-MD benchmark itself relies on AI. OpenAI's GPT-4 model played the role of a "patient AI" in conversation with the "clinical AI" being tested. GPT-4 also helped grade the results by comparing the clinical AI's diagnosis with the correct answer for each case. Human medical experts double-checked these evaluations. They also reviewed the conversations to check the patient AI's accuracy and see if the clinical AI managed to gather the relevant medical information. Multiple experiments showed that four leading large language models - OpenAI's GPT-3.5 and GPT-4 models, Meta's Llama-2-7b model and Mistral AI's Mistral-v2-7b model - performed considerably worse on the conversation-based benchmark than they did when making diagnoses based on written summaries of the cases. OpenAI, Meta and Mistral AI did not respond to requests for comment. For example, GPT-4's diagnostic accuracy was an impressive 82 per cent when it was presented with structured case summaries and allowed to select the diagnosis from a multiple-choice list of answers, falling to just under 49 per cent when it did not have the multiple-choice options. When it had to make diagnoses from simulated patient conversations, however, its accuracy dropped to just 26 per cent. And GPT-4 was the best-performing AI model tested in the study, with GPT-3.5 often coming in second, the Mistral AI model sometimes coming in second or third and Meta's Llama model generally scoring lowest. The AI models also failed to gather complete medical histories a significant proportion of the time, with leading model GPT-4 only doing so in 71 per cent of simulated patient conversations. Even when the AI models did gather a patient's relevant medical history, they did not always produce the correct diagnoses. Such simulated patient conversations represent a "far more useful" way to evaluate AI clinical reasoning capabilities than medical exams, says Eric Topol at the Scripps Research Translational Institute in California. If an AI model eventually passes this benchmark, consistently making accurate diagnoses based on simulated patient conversations, this would not necessarily make it superior to human physicians, says Rajpurkar. He points out that medical practice in the real world is "messier" than in simulations. It involves managing multiple patients, coordinating with healthcare teams, performing physical exams and understanding "complex social and systemic factors" in local healthcare situations. "Strong performance on our benchmark would suggest AI could be a powerful tool for supporting clinical work - but not necessarily a replacement for the holistic judgement of experienced physicians," says Rajpurkar.
Share
Share
Copy Link
A new study reveals that while AI models perform well on standardized medical tests, they face significant challenges in simulating real-world doctor-patient conversations, raising concerns about their readiness for clinical deployment.
A groundbreaking study led by researchers from Harvard Medical School and Stanford University has revealed a significant gap between the performance of AI models in standardized medical tests and their ability to handle real-world patient interactions. The research, published in Nature Medicine, introduces a new evaluation framework called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) designed to assess the capabilities of large language models in medical settings 1.
While AI tools like ChatGPT have shown promise in alleviating clinician workload through patient triage and preliminary diagnoses, the study exposes a striking paradox. Dr. Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School, notes, "While these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit" 2.
The CRAFT-MD framework simulates real-world interactions by evaluating how well large language models can collect patient information and make diagnoses. It employs AI agents to pose as patients and grade the accuracy of diagnoses, with human experts providing additional evaluation 1.
The study tested four AI models, including both proprietary and open-source versions, across 2,000 clinical vignettes. The results showed a significant drop in performance when models engaged in conversational, open-ended interactions compared to answering multiple-choice questions 3.
Key findings include:
The research team offers several recommendations for AI developers and regulators:
While the study highlights current limitations, it also paves the way for more robust AI tools in healthcare. Dr. Rajpurkar emphasizes that even if AI models improve, they would likely serve as powerful support tools rather than replacements for experienced physicians 3.
Reference
[1]
[2]
[3]
A recent study reveals that ChatGPT, when used alone, significantly outperformed both human doctors and doctors using AI assistance in diagnosing medical conditions, raising questions about the future of AI in healthcare.
6 Sources
6 Sources
A collaborative research study explores the effectiveness of GPT-4 in assisting physicians with patient diagnosis, highlighting both the potential and limitations of AI in healthcare.
3 Sources
3 Sources
A new study reveals that AI-powered chatbots can improve physicians' clinical management reasoning, outperforming doctors using conventional resources and matching the performance of standalone AI in complex medical decision-making scenarios.
3 Sources
3 Sources
A new study from UC San Francisco shows that AI models like ChatGPT are not yet ready to make critical decisions in emergency rooms, tending to overprescribe treatments and admissions compared to human doctors.
5 Sources
5 Sources
Recent studies highlight the potential of artificial intelligence in medical settings, demonstrating improved diagnostic accuracy and decision-making. However, researchers caution about the need for careful implementation and human oversight.
2 Sources
2 Sources