AI Chatbots Give Misleading Medical Advice Half the Time, Multiple Studies Reveal

Reviewed byNidhi Govil

17 Sources

Share

Multiple studies reveal AI chatbots deliver problematic health advice 50% of the time, with some platforms fabricating medical references at rates up to 34%. As one in three Americans now turn to AI for health information, researchers warn these tools lack clinical judgment and produce authoritative-sounding but potentially dangerous responses, especially when patient data is incomplete.

News article

AI Chatbots Deliver Problematic Health Advice at Alarming Rates

As one in three Americans turn to AI chatbots for health information, multiple studies reveal a troubling pattern: these AI tools for health advice deliver misleading medical advice at rates that should concern anyone using them. A study published in BMJ Open found that five popular platforms—ChatGPT, Gemini, Meta AI, Grok, and DeepSeek—produced problematic health advice in approximately 50% of cases, with nearly 20% deemed highly problematic

2

. The evaluation involved 250 prompts across five misinformation-prone categories including cancer, vaccines, stem cells, nutrition, and athletic performance

5

.

The consumer AI chatbots performed relatively better on closed-ended prompts and questions related to vaccines and cancer, but struggled significantly with open-ended prompts and domains like nutrition. Open-ended questions generated 40 highly problematic responses compared to just 9 for closed-ended prompts

5

. Critically, these Large Language Models (LLMs) delivered answers with confidence and certainty despite their flaws, creating a dangerous illusion of reliability that could compromise patient safety.

Fake Medical References Undermine Trust and Verification

Beyond inaccurate advice, AI chatbots fabricate medical references at concerning rates. Research published in The Annals of the Royal College of Surgeons of England examined nine AI platforms and discovered hallucination rates ranging from zero to 34% for AI-generated references

3

. Grok 3 performed worst with 34% of references fabricated or unverifiable, while DeepSeek DeepThink followed at 25%. Only five of the nine models tested produced no hallucinated references at all.

The most concerning fake medical references "closely resembled legitimate scientific literature," featuring plausible article titles, invented URLs, and attributions to reputable institutions like the Mayo Clinic

3

. This sophisticated fabrication undermines users' ability to verify whether information is accurate or evidence-based. No chatbot in the BMJ Open study produced a fully complete and accurate reference list in response to any prompt

2

. Additionally, many cited sources were behind academic paywalls, further limiting verification capabilities—though Google Gemini stood out by providing all open-access, directly clickable sources

3

.

AI Chatbot Misdiagnose Rates Exceed 80% With Incomplete Data

When it comes to diagnosis, AI for health care faces even steeper challenges. A study published in Jama Network Open tested 21 LLMs using clinical vignettes and found that failure rates exceeded 80% for all models when performing differential diagnosis with incomplete patient information

4

. The models from OpenAI, Anthropic, Google, xAI, and DeepSeek struggled particularly at the open-ended start of cases when limited data was available. "These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said lead author Arya Rao

4

.

Failure rates dropped below 40% for final diagnoses with complete data, with top performers exceeding 90% accuracy

4

. However, this highlights a critical limitation: real-world users often input vague or patchy information, precisely the scenario where these tools fail most dramatically. A February study in Nature Medicine involving nearly 1,300 participants found that when researchers provided specific medical scenarios, LLMs correctly identified conditions 95% of the time. But when participants used their own prompts for the same scenarios, accuracy plummeted to just one-third of cases

1

. "People don't know what they are supposed to be telling the model," explained lead author Andrew Bean from Oxford University

1

.

Hospitals Deploy Branded Chatbots Despite Evidence Gap

Despite mounting evidence of medical misinformation risks, health systems are rolling out their own branded AI chatbots. K Health is partnering with Hartford HealthCare in Connecticut to deploy its PatientGPT chatbot to tens of thousands of existing patients

1

. CEO Allon Bloch frames this as meeting patients where they are: "Demand is accelerating, and patients are already using AI to navigate their lives"

1

.

Yet experts question whether sufficient evidence supports these deployments. Adam Rodman, a clinical reasoning researcher at Beth Israel Deaconess Medical Center, told Stat News there isn't yet an evidence base showing that integrating chatbots into health systems improves patient outcomes. "We're not there yet," he said

1

. Concerns extend to monitoring adequacy, liability frameworks, and whether chatbots address the actual care problems patients face. A KFF poll found that among Americans using AI for health queries, 19% cited inability to afford care and 18% lacked a regular provider or couldn't get appointments

1

.

What This Means for AI and Healthcare's Future

The explosion of AI chatbots in healthcare occurs against a backdrop of systemic failure. Nearly one-third of Americans—more than 100 million people—lack a primary care provider

1

. OpenAI reports that more than 200 million people ask ChatGPT health and wellness questions weekly

2

, while the KFF poll revealed 41% of AI users uploaded personal medical information like test results

1

.

These tools lack clinical judgment essential for safe medical guidance. Because LLMs generate responses by predicting language patterns rather than retrieving verified facts, they have no built-in mechanism for factual verification

3

. Tim Mitchell, president of the Royal College of Surgeons of England, emphasized that "the excitement around using AI-generated information must be matched with caution, by both patients and doctors"

3

. The BMJ Open study authors warned that without public education and oversight, chatbots risk amplifying misinformation through "authoritative-sounding but potentially flawed responses"

2

.

Watch for regulatory responses addressing liability and transparency requirements. User caution remains essential: verify AI advice with licensed professionals, recognize that premium models may outperform free versions, and understand that confident-sounding prompts don't guarantee accuracy. As specialized medical LLMs like Google's AMIE emerge, their real-world testing with actual patients—particularly in settings with limited doctor access—will determine whether AI can safely supplement rather than substitute for human clinical expertise.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved