AI Diagnosis Falls Short: Large Language Models Fail Clinical Reasoning in Over 80% of Early Cases

Reviewed byNidhi Govil

4 Sources

Share

A comprehensive study published in JAMA Network Open reveals that AI chatbots misdiagnose patients more than 80% of the time during initial assessments. While large language models from OpenAI, Google, and Anthropic excel at final diagnoses with complete data, they struggle significantly at the early stages of clinical reasoning when patient information is limited. The findings underscore the dangers of relying on AI chatbots for medical decisions without human oversight.

Large Language Models Stumble at Differential Diagnosis

Consumer AI chatbots are failing the most critical test in medicine: the ability to think through uncertain, incomplete information. A new study from Mass General Brigham researchers has found that leading large language models struggle dramatically with AI diagnosis when faced with the messy reality of early patient presentations. Published in JAMA Network Open, the research evaluated 21 LLMsβ€”including models from OpenAI, Google, Anthropic, xAI, and DeepSeekβ€”across 29 clinical vignettes, revealing that AI chatbots misdiagnose patients more than 80% of the time during differential diagnosis stages

1

.

Source: Euronews

Source: Euronews

The study's lead author, Arya Rao, a researcher at Mass General Brigham and MD-PhD student at Harvard Medical School, explained the fundamental problem: "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information"

3

. This limitation exposes a critical gap in LLMs in medicineβ€”while they excel at pattern recognition with complete datasets, their diagnostic reasoning capabilities falter precisely when doctors need them most.

PrIME-LLM Score Reveals Hidden Weaknesses

To assess AI diagnostic accuracy more holistically, researchers developed a novel evaluation framework called PrIME-LLM that moves beyond simple right-or-wrong answers. This metric evaluates models across five dimensions: differential diagnosis generation, diagnostic testing selection, final diagnosis accuracy, treatment management, and miscellaneous clinical reasoning. The scores are represented as the area of an irregular pentagon, maxing out at 100% for perfect performance

2

.

Source: News-Medical

Source: News-Medical

No model achieved perfection. PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5, with Gemini 3.0 Flash and Gemini 3.0 Pro also reaching the top tier

3

. The researchers deliberately structured their evaluation to mirror how real medicine unfoldsβ€”gradually feeding models patient data including age, gender, symptoms, physical examination findings, and laboratory results, rather than presenting complete cases all at once

4

.

High Final Accuracy Masks Early-Stage Failures

The paradox at the heart of this research is striking. When provided with comprehensive clinical information, all tested models achieved final diagnosis accuracy exceeding 90%, with some reaching over 95%

1

. Yet this impressive performance disappears at the early stages of clinical reasoning, where failure rates exceeded 80% across all models when generating differential diagnoses with incomplete patient data

1

.

This discrepancy highlights the dangers of relying on AI chatbots for medical decisions. Differential diagnosisβ€”the process of distinguishing one condition from others with similar symptomsβ€”represents what Marc Succi, the study's corresponding author and executive director of the MESH Incubator at Mass General Brigham, calls the "art of medicine"

4

. Most LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text, but their inability to navigate uncertainty in diagnostic scenarios remains a fundamental barrier to unsupervised clinical deployment

4

.

Industry Response and Safety Guardrails

Major AI companies have built safeguards into their products, though the study suggests these may not be sufficient. Claude is trained to direct people who ask medical questions to healthcare professionals, according to Anthropic. Google stated that Gemini is designed similarly and includes reminders prompting users to verify information. OpenAI's usage policy explicitly prohibits using its services to provide medical advice requiring a license without appropriate professional involvement

1

.

Despite these precautions, the research reveals that off-the-shelf large language models are not ready for unsupervised clinical-grade deployment. Succi emphasized that "large language models in healthcare continue to require a 'human in the loop' and very close oversight"

3

. This human oversight requirement becomes particularly critical in situations where patient data may be vague or patchyβ€”exactly the conditions under which people might turn to AI chatbots for quick answers.

Future Implications for Medical LLMs

The study arrives as companies develop more specialized medical LLMs, including Google's Articulate Medical Intelligence Explorer (AMIE) and MedFound. Sanjay Kinra, a clinical epidemiologist at the London School of Hygiene & Tropical Medicine, noted that early results from models like AMIE showed promise but acknowledged they're unlikely to match how doctors' clinical assessments "rely heavily on the look and feel of the patient"

1

.

Yet Kinra also pointed to a potential role in resource-limited settings: "Nevertheless, they may have a role to play, particularly in situations or geographies in which access to doctors is limited. So we urgently need research studies with actual patients from those settings"

1

. This suggests that while current large language models fails at primary diagnosis in controlled testing environments, their real-world value may depend heavily on context and the availability of alternatives.

Susana Manso GarcΓ­a, a member of the Artificial Intelligence and Digital Health working group of the Spanish Society of Family and Community Medicine, offered clear guidance for the public: "The recommendation for the public is to use these technologies with caution and, when faced with any health concern, always consult a healthcare professional"

4

. As AI continues advancing incrementallyβ€”newer models generally outperformed older ones in the studyβ€”the technology's promise lies in augmenting rather than replacing physician reasoning, provided all relevant data is available.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo