AI outperforms ER doctors in diagnostic accuracy, but collaboration remains the goal

Reviewed byNidhi Govil

8 Sources

Share

A groundbreaking study published in Science reveals that OpenAI's o1-preview reasoning model achieved 67.1% diagnostic accuracy on real emergency department cases, surpassing two expert physicians who scored 55.3% and 50%. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center emphasize that while AI in medicine shows remarkable potential, the findings point toward collaborative care models rather than physician replacement.

AI Diagnostic Accuracy Reaches New Benchmark in Emergency Medicine

A landmark study published in the Journal of Science demonstrates that AI in medicine has achieved a significant milestone, with OpenAI's o1-preview model matching or exceeding physician-level clinical diagnostic reasoning on authentic medical cases

1

. Researchers led by Arjun Manrai from Harvard Medical School and Adam Rodman from Beth Israel Deaconess Medical Center tested the OpenAI o1-preview model across six experiments, including 76 actual emergency department cases and 143 complex clinical vignettes published in The New England Journal of Medicine

2

.

The results reveal striking advances in AI diagnostic accuracy. When evaluating real emergency room patients at triage, the AI as a diagnostic tool achieved 67.1% exact or very-close diagnostic accuracy, while two expert attending physicians scored 55.3% and 50.0% respectively

4

. Blinded physician reviewers could not distinguish the AI output from human diagnoses. On published clinical vignettes, the o1-preview model included the correct diagnosis in its differential in 78.3% of cases and suggested a helpful diagnosis in 97.9% of cases, vastly outperforming GPT-4, which achieved 72.9% accuracy

1

.

Source: CNET

Source: CNET

How Large Language Models Enable Clinical Diagnostic Reasoning

The OpenAI o1-preview model represents a new class of reasoning models—Large Language Models (LLMs) enhanced with the capability to work through complex problems step by step before responding, mirroring structured thinking

1

. This deliberative approach proved particularly effective during early-stage triage when decisions must be made with limited information. The model handled uncertainty far better than human clinicians, using fragmented or unstructured electronic health records and notes more effectively

2

.

Rodman described a compelling case where a patient presented with routine respiratory symptoms after an organ transplant. The AI model suspected a dangerous flesh-eating infection from the very beginning, approximately 12 to 24 hours before human physicians would have become suspicious of this condition

3

. In another instance, when a pulmonary embolism patient's symptoms worsened despite treatment, the AI correctly identified lupus-related heart inflammation as the underlying cause by scanning medical records

5

.

Source: Science News

Source: Science News

Integrating AI into Healthcare: Collaboration Over Replacement

Despite the impressive performance, researchers emphasize that AI outperforms doctors in specific contexts but should not replace them. "I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say," Manrai stated during a press briefing

2

. The prevailing proposal for AI in emergency medicine focuses on collaborative care models, with clinicians providing oversight, contextual judgment, and accountability

1

.

Prior research using clinical vignettes found no substantial difference between physicians augmented with GPT-4 and GPT-4 working alone, though both outperformed physicians with only conventional resources

1

. This suggests that determining optimal implementation requires evaluating AI alone, clinician alone, and clinician with AI—a critical consideration as clinicians already integrate AI tools into practice, sometimes without institutional oversight

1

.

Source: Mashable

Source: Mashable

Limitations and the Path to Missed Diagnoses Prevention

While the study establishes a foundation for authentic evaluation across text-based tasks, real clinical work relies heavily on visual and auditory cues from physical examinations

1

. The o1 models were limited to text-only input and currently underperform on most medical imaging benchmarks

4

. Newer multimodal AI systems like GPT-5.3 and Gemini 3.1 Pro can process text, images, audio, and video together, potentially enabling assessments that more closely mirror actual clinical diagnosis

1

.

Separate research by Arya Rao at Harvard Medical School identified a persistent weak point in AI reasoning: considering several different uncertain diagnoses. LLM-based models tend to jump to conclusions, with reasoning that is "brittle precisely where uncertainty and nuance matter most"

3

. Concerns about AI hallucinations and patient safety also persist, with researchers noting that AI can spontaneously develop unexpected behaviors and provide incorrect information

4

.

Urgent Need for Clinical Trials and Rigorous Standards

The findings indicate an urgent need to understand how these tools can be safely integrated into clinical workflows through prospective clinical trials

1

. "We're witnessing a really profound change in technology that will reshape medicine, and we need to evaluate this technology now, and rigorously conduct in prospective clinical trials," Manrai emphasized

2

. Regulators, hospitals, and healthcare providers must work together to test these tools thoroughly before deployment to ensure patient safety and health equity for all patients

2

.

Researchers at Flinders University wrote in a concurrent Science commentary that "we do not allow doctors to practice without supervision and evaluation, and AI should be held to comparable standards"

2

. As of 2025, 1 in 5 doctors and nurses worldwide used AI for a second opinion on complex cases, with over half wanting to use it for this purpose

3

. With such widespread interest, establishing decision support systems that balance AI capabilities with human expertise becomes critical for the future of medicine.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved