4 Sources
4 Sources
[1]
AI chatbots misdiagnose in over 80% of early medical cases, study finds
Consumer AI chatbots falter when used to make medical diagnoses, particularly when faced with incomplete information, according to new research highlighting the risks of relying on them as digital doctors. The study finds that leading large language models struggle to suggest a range of possible diagnoses when patient data is limited, frequently narrowing too quickly to a single answer. The results point to a broader limitation in AI: while chatbots can identify likely conditions once a case is fully specified, they are less reliable at the earlier, more uncertain stages of clinical reasoning. The findings highlight the dangers of relying on the technology alone to pinpoint health problems, particularly in cases where the data users input may be vague or patchy. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said Arya Rao, the study's lead author and a researcher at the Massachusetts-based Mass General Brigham healthcare system. The study, published in Jama Network Open on Monday, tested AI models using 29 clinical vignettes based on a standard medical reference text. The experiment involved step-by-step disclosure of data including the history of present illness, physical examination findings and laboratory results. The researchers posed the LLMs diagnostic queries and measured their failure rates, defined as the proportion of questions not answered fully correctly. The researchers evaluated 21 LLMs, including leading models by OpenAI, Anthropic, Google, xAI and DeepSeek. It found that failure rates exceeded 80 per cent for all models when they needed to do so-called differential diagnosis -- when full patient information was lacking. The failure rates fell to less than 40 per cent for final diagnoses with more complete data, with the best performers exceeding 90 per cent accuracy. Claude is trained to direct people who ask medical questions to professionals, Anthropic said. Gemini is designed to do the same and has reminders built into its app to prompt users to double-check information, Google said. OpenAI's usage policy says its services should not be used to provide medical advice requiring a licence without appropriate professional involvement. xAI did not respond to a request for comment. DeepSeek could not be reached for comment. Companies have been developing more specialised medical LLMs such as Google's Articulate Medical Intelligence Explorer (AMIE) and MedFound. Early results from evaluations of models such as AMIE were promising, said Sanjay Kinra, a clinical epidemiologist at the London School of Hygiene & Tropical Medicine. But they were unlikely to be able to match how doctors' clinical assessments "rely heavily on the look and feel of the patient", he added. "Nevertheless, they may have a role to play, particularly in situations or geographies in which access to doctors is limited," Kinra said. "So we urgently need research studies with actual patients from those settings."
[2]
Ready or Not, LLMs Are Coming for Medicine
This transcript has been edited for clarity. Welcome to Impact Factor, your weekly dose of commentary on a new medical study. I'm Dr F. Perry Wilson from the Yale School of Medicine. There's a new genre of medical papers in the "AI in medicine" space, and, like Mulder from The X-Files, I want to believe. The theme of these papers is something like "LLMs aren't going to replace good doctors," and that is very reassuring for this doctor, who definitely does not want to be replaced by a friendly AI. But if I'm honest with myself, my belief in human supremacy here is being shaken. This week, a new paper appeared in JAMA Network Open evaluating the performance of various large language models on a diagnostic task. It was chock full of phrases like "our evaluation suggests that despite rapid advances in pattern recognition and knowledge retrieval, current LLMs still lack the reasoning processes needed for safe clinical use." That's reassuring. It also states that "the promise of LLMs in clinical medicine lies in their potential to augment -- not replace -- physician reasoning." Perfect. I'm happy to be augmented. I would rather not be replaced. But then I read through the study and, frankly, I'm more worried now than ever. Let me break it down for you. Researchers evaluated 21 off-the-shelf large language models, including all the major players (ChatGPT, Claude, DeepSeek, Grok, Gemini), across 29 clinical vignettes from the MSD manual. The key innovation of this study over previous evaluations is how these vignettes are structured. Other studies evaluating LLMs often present an entire case and simply ask "what's the diagnosis?", but that's not how these clinical vignettes work. Instead, they develop iteratively. You get an initial presentation and then formulate a differential diagnosis, choose the appropriate tests, get those results, refine your differential, and so on until you arrive at the final diagnosis. This mirrors how real medicine works. No one ever shows up in the ER with a complete history of present illness, labs, and imaging all bundled together. In the end, you still have to give a final diagnosis, of course, and the models did really well on this -- getting it right more than 90% of the time. I'll show you the accuracy here; it looks like DeepSeek edged out the others, but these are all pretty darn close and almost always hit the mark. One of the major limitations is that the authors don't include any human comparison data in their study. I searched the literature for these MSD vignettes to find out how a human doctor would do on them, and to my great surprise every study I found was actually a study of various chatbot performances. Has anyone ever tried to see how a human does on these? Is 95% accuracy good? It seems good. For what it's worth, I looked at several of the cases -- they are publicly available. They're not easy or straightforward. I'm not saying I'm a master diagnostician or anything, but there's no way I would get 95% correct in terms of the final diagnosis. The authors didn't focus on that metric solely, though, as important as it is. Instead, accuracy of the final diagnosis is one element of a five-part scoring scheme including accuracy of differential diagnosis, diagnostic testing, management, and miscellaneous clinical reasoning. These five metrics comprise a novel "PriME-LLM" score, defined by the area of an irregular pentagon -- maxing out at 100% if the model performed perfectly on every single question. No model did. PriME-LLM scores varied between 0.64 at the low end (Gemini 1.5) and 0.78 at the peak (Grok 4, Gemini 3 Flash, Gemini 3 Pro). I cannot tell you whether that is good or not, because there are no human data against which to standardize. But the paper is written, like many in this space, with what feels to me to be a clear anti-LLM bias. For example, the authors state that the failure rates in generating a differential diagnosis were anywhere from 90% to 100%. But we have to look at how they defined failure here. The vignettes are structured with part of the case presentation and then a question and a long list of potential items that could be on the differential diagnosis. To be "correct," the model needs to flag everything that could be on the differential and nothing that shouldn't be on the differential. They are clearly not good at this. But I'm not sure how good any of us would be at that. In other ways, the models felt hamstrung to me. Models with reasoning capabilities had reasoning turned off, if that option was available. Models that could access the internet were not allowed to access the internet. I suppose this was to level the playing field, but it seems to me like these are some of the potential strengths of an AI diagnostician. On the other side of the coin, the authors acknowledge that they can't be sure these MSD questions weren't already part of the model training set, in which case all of these responses are suspect no matter what. Although if this was all just regurgitating training data, you would think the models would nail that highly tricky differential diagnosis question. Basically, what we have here is a set of questions that, whatever the initial purpose of their design, have basically been turned into a standard LLM benchmark. We can thus compare LLMs to each other on this metric, but without standardized human data, we have no idea how they are performing compared to regular doctors. And that 95% accuracy in final diagnosis is nothing to sneeze at. I'd be surprised if any of us could do as well. So, here's how I see the future unfolding. Soon, patients will demand that an AI agent, perhaps one trained specifically for the purpose, "reviews" the diagnosis and management plan made by the physician. Or, if patients won't, insurance companies will. The AI will, as the authors suggest, "augment" the physician's effort. And then AI agents will creep in around the edges. Perhaps an urgent care triage line will be staffed by AI before passing a patient on to a physician. Maybe an AI agent can perform an initial history, and even order some basic tests, while the patient is waiting in the ER waiting room, teeing them up for when the doctor is ready. And then, at some point, a randomized trial will compare AI agents to humans directly in these spaces and show, probably, non-inferiority. The FDA will approve an initial agentic care provider, and then others through the 510(K) pathway. Some state will pass a law allowing AI agents to order tests or medications without physician input. There will still be doctors around of course. But fewer, and in more oversight roles. I'm not sure what the timeline of this will be, but I think it's shorter than we expect. And when it happens, we will look back at papers like this, flagging how bad LLMs are at differential diagnosis, and ask ourselves how we missed so many obvious signs of which way the wind was blowing. F. Perry Wilson, MD, MSCE, is an associate professor of medicine and public health and director of Yale's Clinical and Translational Research Accelerator. His science communication work can be found in the Huffington Post, on NPR, and here on Medscape. He posts at @fperrywilsonand his book, How Medicine Works and When It Doesn't, is available now.
[3]
Generative AI falls short in diagnostic reasoning despite accuracy
Mass General BrighamApr 13 2026 Despite increasing use of artificial intelligence (AI) in health care, a new study led by Mass General Brigham researchers from the MESH Incubator shows that generative AI models continue to fall short at their clinical reasoning capabilities. By asking 21 different large language models (LLMs) to play doctor in a series of clinical scenarios, the researchers showed that LLMs often fail often fail at navigating diagnostic workups and coming up with a testable list of potential or "differential" diagnoses. Though all tested LLMs arrived at a correct final diagnosis more than 90% of the time when provided with all pertinent information in a patient case, they consistently performed poorly at the earlier, reasoning-driven steps of the diagnostic process, according to the results published in JAMA Network Open. Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment. Differential diagnoses are central to clinical reasoning and underlie the 'art of medicine' that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available - not always the case" Marc Succi, MD, corresponding author, executive director of the MESH Incubator at Mass General Brigham This new research is a follow-up to previous work led by Succi's MESH group in which researchers evaluated ChatGPT 3.5 ability to accurately in diagnose a series of a clinical vignettes. In the new study, the researchers developed a novel and more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a model's competency across different stages of clinical reasoning-coming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness, according to the researchers. The study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the models' ability to work through 29 published clinical cases. To simulate the way that clinical cases unfold, the researchers gradually fed the models information, beginning with basics like a patient's age, gender and symptoms before adding physical examination findings and laboratory results. The LLMs' performance at each stage was assessed by medical student evaluators, and these evaluations were used to calculate the models' overall PrIME-LLM scores. In line with their previous study, the researchers found that the LLMs were good at producing accurate final diagnoses. However, all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. In the real world, a differential diagnosis is critical, but in this study, the models were given more information so that they could proceed to the next stage of the clinical workup even if they failed at the differential diagnosis step. "By evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor," said Arya Rao, lead author, MESH researcher, and MD-PhD student at Harvard Medical School. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information." Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text. More recently released models generally outperformed older models, showing that LLMs are improving incrementally. The models' PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5. According to Succi, PrIME-LLM represents a standardized way to evaluate AI's clinical competency that could be used by AI developers and hospital leaders to benchmark new technologies as they are released. "We want to help separate the hype from the reality of these tools as they apply to health care," he said. "Our results reinforce that large language models in healthcare continue to require a 'human in the loop' and very close oversight." Mass General Brigham Journal reference: Rao, A. S., et al. (2026). Large Language Model Performance and Clinical Reasoning Tasks. JAMA Network Open. DOI: 10.1001/jamanetworkopen.2026.4003. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679
[4]
AI fails at primary diagnosis more than 80% of the time, study finds
Generative artificial intelligence (AI) still lacks the reasoning processes needed for safe clinical use, a new study has found. AI chatbots have improved their diagnostic accuracy when presented with comprehensive clinical information, but still failed to produce an appropriate differential diagnosis more than 80% of the time, according to researchers at Mass General Brigham, a Boston-based non-profit hospital and research network and one of the largest health systems in the United States. The results of the study, published in the open-access JAMA Network Open medical journal, found that large language models' (LLMs) fall short of the reasoning required for clinical use. "Despite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment," said Marc Succi, co-author of the study. He added that AI cannot yet replicate differential diagnosis, which is central to clinical reasoning, and which he considers the "art of medicine". Differential diagnosis is the first step for healthcare professionals to identify a condition, separating it from others with similar symptoms. The research team analysed the functioning of 21 LLMs, including the latest available versions of Claude, DeepSeek, Gemini, GPT and Grok. They evaluated the LLMs on 29 standardised clinical vignettes using a newly developed tool called PrIME-LLM. The tool assesses a model's ability across different stages of clinical reasoning: conducting an initial diagnosis, ordering appropriate tests, arriving at a final diagnosis, and planning treatment. To simulate how clinical cases unfold, the researchers gradually fed the models information, beginning with basics such as a patient's age, sex and symptoms, before adding physical examination findings and laboratory results. A differential diagnosis is critical in a real-world clinical setting to advance to the next step. However, in the study, the models were given additional information so that they could proceed to the next stage even if they failed at the differential diagnosis step. The researchers found that the language models achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty. Study author Arya Rao noted that by evaluating LLMs in a stepwise fashion, research moves past treating them like test-takers and puts them in a doctor's position. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," she added. The researchers found that all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. On final diagnosis, success rates ranged from around 60% to over 90% depending on the model. Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text. The results identified a top-performing cluster that included Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, Gemini 3.0 Flash and Gemini 3.0 Pro. However, the authors noted that despite version-based improvements and advantages in reasoning-optimised models, off-the-shelf LLMs have not yet achieved the level of intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning. "Our results reinforce that large language models in healthcare continue to require a 'human in the loop' and very close oversight," Succi noted. Susana Manso GarcΓa, a member of the Artificial Intelligence and Digital Health working group of the Spanish Society of Family and Community Medicine, who was not involved in the study, said the findings carry a clear message for the public. "The study itself insists they [language models] should not be used to make clinical decisions without supervision. Therefore, whilst artificial intelligence represents a promising tool, human clinical judgement remains indispensable," she said. "The recommendation for the public is to use these technologies with caution and, when faced with any health concern, always consult a healthcare professional."
Share
Share
Copy Link
A comprehensive study published in JAMA Network Open reveals that AI chatbots misdiagnose patients more than 80% of the time during initial assessments. While large language models from OpenAI, Google, and Anthropic excel at final diagnoses with complete data, they struggle significantly at the early stages of clinical reasoning when patient information is limited. The findings underscore the dangers of relying on AI chatbots for medical decisions without human oversight.
Consumer AI chatbots are failing the most critical test in medicine: the ability to think through uncertain, incomplete information. A new study from Mass General Brigham researchers has found that leading large language models struggle dramatically with AI diagnosis when faced with the messy reality of early patient presentations. Published in JAMA Network Open, the research evaluated 21 LLMsβincluding models from OpenAI, Google, Anthropic, xAI, and DeepSeekβacross 29 clinical vignettes, revealing that AI chatbots misdiagnose patients more than 80% of the time during differential diagnosis stages
1
.
Source: Euronews
The study's lead author, Arya Rao, a researcher at Mass General Brigham and MD-PhD student at Harvard Medical School, explained the fundamental problem: "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information"
3
. This limitation exposes a critical gap in LLMs in medicineβwhile they excel at pattern recognition with complete datasets, their diagnostic reasoning capabilities falter precisely when doctors need them most.To assess AI diagnostic accuracy more holistically, researchers developed a novel evaluation framework called PrIME-LLM that moves beyond simple right-or-wrong answers. This metric evaluates models across five dimensions: differential diagnosis generation, diagnostic testing selection, final diagnosis accuracy, treatment management, and miscellaneous clinical reasoning. The scores are represented as the area of an irregular pentagon, maxing out at 100% for perfect performance
2
.
Source: News-Medical
No model achieved perfection. PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5, with Gemini 3.0 Flash and Gemini 3.0 Pro also reaching the top tier
3
. The researchers deliberately structured their evaluation to mirror how real medicine unfoldsβgradually feeding models patient data including age, gender, symptoms, physical examination findings, and laboratory results, rather than presenting complete cases all at once4
.The paradox at the heart of this research is striking. When provided with comprehensive clinical information, all tested models achieved final diagnosis accuracy exceeding 90%, with some reaching over 95%
1
. Yet this impressive performance disappears at the early stages of clinical reasoning, where failure rates exceeded 80% across all models when generating differential diagnoses with incomplete patient data1
.This discrepancy highlights the dangers of relying on AI chatbots for medical decisions. Differential diagnosisβthe process of distinguishing one condition from others with similar symptomsβrepresents what Marc Succi, the study's corresponding author and executive director of the MESH Incubator at Mass General Brigham, calls the "art of medicine"
4
. Most LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text, but their inability to navigate uncertainty in diagnostic scenarios remains a fundamental barrier to unsupervised clinical deployment4
.Related Stories
Major AI companies have built safeguards into their products, though the study suggests these may not be sufficient. Claude is trained to direct people who ask medical questions to healthcare professionals, according to Anthropic. Google stated that Gemini is designed similarly and includes reminders prompting users to verify information. OpenAI's usage policy explicitly prohibits using its services to provide medical advice requiring a license without appropriate professional involvement
1
.Despite these precautions, the research reveals that off-the-shelf large language models are not ready for unsupervised clinical-grade deployment. Succi emphasized that "large language models in healthcare continue to require a 'human in the loop' and very close oversight"
3
. This human oversight requirement becomes particularly critical in situations where patient data may be vague or patchyβexactly the conditions under which people might turn to AI chatbots for quick answers.The study arrives as companies develop more specialized medical LLMs, including Google's Articulate Medical Intelligence Explorer (AMIE) and MedFound. Sanjay Kinra, a clinical epidemiologist at the London School of Hygiene & Tropical Medicine, noted that early results from models like AMIE showed promise but acknowledged they're unlikely to match how doctors' clinical assessments "rely heavily on the look and feel of the patient"
1
.Yet Kinra also pointed to a potential role in resource-limited settings: "Nevertheless, they may have a role to play, particularly in situations or geographies in which access to doctors is limited. So we urgently need research studies with actual patients from those settings"
1
. This suggests that while current large language models fails at primary diagnosis in controlled testing environments, their real-world value may depend heavily on context and the availability of alternatives.Susana Manso GarcΓa, a member of the Artificial Intelligence and Digital Health working group of the Spanish Society of Family and Community Medicine, offered clear guidance for the public: "The recommendation for the public is to use these technologies with caution and, when faced with any health concern, always consult a healthcare professional"
4
. As AI continues advancing incrementallyβnewer models generally outperformed older ones in the studyβthe technology's promise lies in augmenting rather than replacing physician reasoning, provided all relevant data is available.Summarized by
Navi
[2]
05 Mar 2026β’Health

09 Feb 2026β’Health

02 Jan 2025β’Health
