AI Chatbots Give Misleading Health Advice Nearly Half the Time, Major Study Reveals

Reviewed byNidhi Govil

5 Sources

Share

A major audit of leading AI chatbots including ChatGPT, Gemini, and Grok found that nearly 50% of responses to health questions contain misleading or problematic information. With one in four Americans now using AI for health advice, the study published in BMJ Open highlights urgent public health risks and the need for stronger oversight of AI in healthcare.

AI Chatbots Deliver Problematic Health Information in Half of Responses

A comprehensive audit of five leading AI chatbots has uncovered alarming rates of misinformation in responses to common health questions. The study, published in BMJ Open, found that 49.6% of answers provided by ChatGPT, Gemini, Grok, Meta AI, and DeepSeek were problematic—with 30% deemed somewhat problematic and 20% highly problematic

1

. The findings arrive at a critical moment, as one in four Americans now use AI chatbots for health advice, and approximately 14 million have skipped seeing healthcare providers based on chatbot recommendations

3

.

Source: News-Medical

Source: News-Medical

Researchers from the Lundquist Institute for Biomedical Innovation tested these platforms across five misinformation-prone categories: vaccines, cancer treatments, stem cells, nutrition, and athletic performance. Using 50 adversarial prompts designed to push models toward questionable advice, the team evaluated accuracy, reference completeness, and readability. The risks of using AI chatbots became evident when models recommended unproven alternative therapies alongside evidence-based treatments, creating what researchers termed a "false balance" that presents scientific and unscientific claims on equal footing

2

.

Cancer and Vaccine Questions Show Best Performance, But Still Fall Short

While chatbot reliability for medical advice varied by topic, even the best-performing categories raised concerns. Questions about vaccines and cancer returned approximately 75% non-problematic responses—the highest proportion among tested categories. However, this still means a 25% chance of receiving potentially harmful information

4

. Stem cell queries received the most problematic content, with problematic responses exceeding non-problematic ones in nutrition and athletic performance categories

1

.

When asked which alternative therapies are better than chemotherapy to treat cancer, the chatbots warned that alternative treatments are unproven, but still gave acupuncture, herbal medicine, and "cancer-fighting diets" the same consideration as chemotherapy

2

. Dr. Michael Foote, an assistant attending professor at Memorial Sloan Kettering Cancer Center, emphasized the danger: "Some of these medicines aren't evaluated by the FDA, can hurt your liver, hurt your metabolism and some of them hurt you by patients relying on them and not doing conventional treatments"

5

.

Performance Gaps and the Hallucination Problem

Grok consistently produced the most highly problematic responses at 58%, compared to Gemini's 40%, though all models showed fundamental flaws

1

4

. The study revealed that prompt type significantly influenced response quality. Open-ended questions—which mirror how people actually search for health information—yielded 32% highly problematic answers, compared to just 7% for closed-ended prompts

4

.

The issue of hallucination proved particularly troubling. When researchers requested 10 scientific references from each chatbot, the median completeness score was merely 40%. No chatbot managed a single fully accurate reference list across 25 attempts, with errors ranging from wrong authors and broken links to entirely fabricated papers

1

. This matters because citations appear as proof, giving lay readers little reason to doubt the content.

Training Data and Fundamental Limitations Drive Inaccurate Medical Guidance

The root cause of these public health risks lies in how language models function. AI chatbots are trained on vast volumes of public data, including peer-reviewed medical research, but also Reddit threads, wellness blogs, and social media arguments. These models predict statistically likely responses rather than weighing evidence or making informed judgments

4

. Lead author Nicholas Tiller explained that the adversarial prompts used in testing reflect real-world usage: "If somebody believes that raw milk is going to be beneficial, then the search terms are already going to be primed with that kind of language"

2

.

Source: Silicon Republic

Source: Silicon Republic

Chatbots also exhibit sycophancy, prioritizing agreement over factual correctness, which can result in answers that align with user expectations rather than scientific consensus. Dr. Ashwin Ramaswamy, an instructor of urology at Mount Sinai Hospital, emphasized the technology gap: "The technology that's needed, the methodology that's needed for the FDA, for people, for doctors, to understand how it works and to have trust in the system is not there yet" .

Patient Safety Concerns Mount as AI in Healthcare Expands

The implications extend beyond individual misinformation incidents. Dr. Foote noted that AI can create unnecessary fear and emotional distress: "I've encountered where patients come in crying, really upset because the AI chatbot told them they have six to 12 months to live, which, of course, is totally ridiculous" . A February 2026 study in Nature Medicine revealed another dimension to the problem: while chatbots themselves could provide correct medical answers almost 95% of the time, real people using those same chatbots only got the right answer less than 35% of the time—no better than people who didn't use them at all

4

.

Source: Futurism

Source: Futurism

The study's authors emphasize that misinformation constitutes a serious public health threat, spreading farther and deeper than truth across all information categories. As AI companies like OpenAI launch healthcare-specific products such as ChatGPT Health, which encourages users to upload medical records, the need for systematic oversight becomes more pressing . With diagnostic accuracy varying widely based on available information and the technology still unable to reliably distinguish between evidence-based and non-evidence-based claims, experts warn that AI chatbots are not yet ready for widespread use in delivering health advice without significant improvements in accuracy and transparency.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved