AI chatbots fail to improve medical advice for patients, Oxford study reveals

Reviewed byNidhi Govil

14 Sources

Share

A University of Oxford study published in Nature Medicine found that AI chatbots offer no advantage over internet searches when patients seek medical advice. Despite large language models achieving 94.9% accuracy in controlled tests, real-world human-AI interaction in healthcare revealed a troubling gap between AI potential and performance, with patients struggling to provide complete information and receiving inconsistent advice.

AI Chatbots Perform No Better Than Traditional Methods for Patients Seeking Medical Advice

A comprehensive study from the University of Oxford has revealed that AI medical advice provides no measurable benefit to patients compared to traditional methods like internet searches. Published in Nature Medicine

1

5

, the research examined how 1,298 UK participants assessed health conditions across ten medical scenarios ranging from common colds to life-threatening brain hemorrhages. Researchers from the Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences partnered with MLCommons to evaluate whether large language models including GPT-4o, Llama 3, and Command R+ could help people make better health decisions

1

.

Source: 404 Media

Source: 404 Media

The findings challenge the growing trend of relying on AI chatbots for health guidance. Mental Health UK polling from November 2025 found that more than one in three UK residents now use AI to support their mental health or wellbeing

4

. Yet this study suggests such reliance may be misplaced, with participants using AI vs internet search showing no improvement in identifying relevant conditions or recommending appropriate courses of action.

The Troubling Gap Between AI Potential and Performance

When tested without human participants, the three large language models demonstrated impressive capabilities, identifying conditions correctly in 94.9% of cases and selecting the appropriate course of action in 56.3% of cases

2

5

. However, when real people interacted with these systems, performance collapsed dramatically. Relevant conditions were identified in less than 34.5% of cases, and the correct course of action was given in less than 44.2% of interactions—no better than the control group using traditional resources

5

.

Source: Euronews

Source: Euronews

Adam Mahdi, associate professor at Oxford and co-author of the paper, described this as a "huge gap" between the potential of AI and the pitfalls when used by people. "The knowledge may be in those bots; however, this knowledge doesn't always translate when interacting with humans," he explained

2

. The human-AI interaction in healthcare proved far more complex than benchmark testing suggested, revealing limitations that controlled experiments failed to capture.

Incomplete Information and Misleading Responses Create Dangerous Scenarios

The study identified two critical problems: humans providing incomplete information and AI chatbots generating misleading responses. When researchers analyzed around 30 interactions in detail, they found patients often failed to share complete symptom details, leaving out crucial information

5

. "People share information gradually. They leave things out, they don't mention everything," Mahdi told the BBC

4

.

Even more concerning, the systems delivered inaccurate medical advice that could endanger lives. In one documented case, two users described nearly identical symptoms of a subarachnoid hemorrhage—a life-threatening condition causing bleeding on the brain. One patient mentioning the "worst headache ever" was correctly advised to seek emergency care, while another describing a "terrible" headache was told to lie down in a darkened room

2

5

. The models also provided geographically confused guidance, recommending partial US phone numbers alongside "Triple Zero," the Australian emergency number

1

.

AI and Medical Misinformation Spread Through Authoritative-Sounding Sources

A separate study published in The Lancet Digital Health adds another layer of concern about AI and medical misinformation. Researchers at Mount Sinai tested 20 large language models and found they were more likely to propagate incorrect medical advice when misinformation came from authoritative-sounding sources

3

. When false information appeared in realistic hospital discharge notes, AI tools believed and passed it along 47% of the time, compared to just 9% for misinformation from social media platforms like Reddit

3

.

"Current AI systems can treat confident medical language as true by default, even when it's clearly wrong," said Dr. Eyal Klang of the Icahn School of Medicine at Mount Sinai

3

. User prompts also affected accuracy, with authoritative-sounding questions increasing the likelihood that AI would agree with false information. Overall, the AI models believed fabricated information from roughly 32% of content sources, though OpenAI's GPT models proved least susceptible while other models accepted up to 63.6% of false claims

3

.

Benchmark Testing Fails to Reflect Real-World Medical Decision-Making

The Oxford research highlights a fundamental problem with how AI systems are evaluated for healthcare applications. Models trained on medical textbooks and clinical notes may excel at structured medical licensing exams, but this performance doesn't translate to real-world medical decision-making

1

. "Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine," explained Luc Rocher, associate professor at the Oxford Internet Institute

1

.

Source: France 24

Source: France 24

Doctors spend years developing triage skills using rule-based protocols designed to minimize errors—experience that AI systems lack despite their vast knowledge bases. Lead author Andrew Bean noted that the analysis illustrated how human-AI interaction poses challenges "even for top" AI models

4

. Dr. Rebecca Payne, lead medical practitioner on the study, warned it could be "dangerous" for people to ask chatbots about their symptoms

4

.

Healthcare Safeguards Needed Before AI Deployment

The researchers concluded that AI chatbots aren't ready for real-world use in helping patients assess health conditions. "Despite strong performance on medical benchmarks, providing people with current generations of LLMs does not appear to improve their understanding of medical information," the study states

1

. Rocher warned that as more people rely on chatbots for medical advice, "we risk flooding already strained hospitals with incorrect but plausible diagnoses"

1

.

Dr. Girish Nadkarni, chief AI officer of Mount Sinai Health System, emphasized the need for built-in healthcare safeguards: "AI has the potential to be a real help for clinicians and patients, offering faster insights and support. But it needs built-in safeguards that check medical claims before they are presented as fact"

3

. Dr. Bertalan Meskó, editor of The Medical Futurist, noted that OpenAI and Anthropic recently released health-dedicated versions of their chatbots, which may yield different results, but stressed the need for "clear national regulations, regulatory guardrails and medical guidelines"

4

.

The Oxford team plans similar studies across different countries, languages, and time periods to assess whether these factors impact AI for assessing health conditions

2

. For now, the message is clear: AI medical advice requires substantial improvements before it can safely assist the public with healthcare decisions.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo