ChatGPT Health fails critical emergency safety tests, raising concerns for 40 million users

Reviewed byNidhi Govil

2 Sources

Share

A Mount Sinai study published in Nature Medicine found that ChatGPT Health, used by 40 million people daily, failed to direct users appropriately to emergency care in more than half of serious cases. The safety evaluation also revealed alarming inconsistencies in suicide-crisis safeguards, with alerts appearing less reliably when users described specific self-harm plans.

ChatGPT Health Under-Triaged Emergency Cases in First Independent Safety Evaluation

ChatGPT Health, the consumer AI tool providing health guidance to approximately 40 million daily users, fails critical emergency safety tests and shows dangerous inconsistencies in suicide-crisis safeguards, according to a study published February 23, 2026, in Nature Medicine

1

2

. Researchers at the Icahn School of Medicine at Mount Sinai Health System conducted the first independent safety evaluation of the large language model (LLM-based tool since OpenAI launched it in January 2026. The findings reveal significant blind spots in AI medical triage that could put users at risk when seeking guidance on whether to seek urgent or emergency care.

Source: News-Medical

Source: News-Medical

The study tested 60 structured clinical scenarios spanning 21 medical specialties, with three independent physicians determining the correct level of urgency using guidelines from 56 medical societies. Each patient scenario was evaluated under 16 different contextual conditions, including variations in race, gender, social dynamics, and barriers to care like lack of insurance or transportation. This rigorous methodology resulted in 960 interactions with ChatGPT Health, compared against physician consensus to assess triage recommendations

1

.

AI Health Guidance Failed to Direct Users to Emergency Care in Serious Medical Situations

The research revealed that while ChatGPT Health generally handled textbook emergencies like stroke or severe allergic reactions correctly, it under-triaged emergency cases more than half the time when physicians determined emergency room visits were necessary. Lead author Ashwin Ramaswamy, MD, Instructor of Urology at the Icahn School of Medicine, explained the tool struggled most in nuanced situations where danger isn't immediately obvious—precisely when clinical judgment matters most

2

.

In one particularly concerning example involving an asthma scenario, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment. This pattern repeated across multiple serious medical situations: the tool often demonstrated recognition of dangerous findings in its own explanations yet still reassured patients instead of directing them to immediate care

1

.

Inconsistent Suicide-Crisis Safeguards Raised Alarm Among Researchers

Perhaps most troubling, the study identified serious flaws in ChatGPT Health's suicide-risk alerts, which were designed to direct users to the 988 Crisis Lifeline in high-risk scenarios. The alerts appeared inconsistently, sometimes triggering in lower-risk situations while failing to appear when users described specific plans for self-harm. Senior author Girish N. Nadkarni, MD, MPH, Chief AI Officer of the Mount Sinai Health System, described the findings as "particularly surprising and concerning," noting that the system's alerts were "inverted relative to clinical risk"

1

.

"In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less," Nadkarni emphasized. This inversion of risk assessment represents a critical failure in the tool's ability to identify and respond appropriately to users in crisis

2

.

Independent Evaluation Should Be Routine for AI Medical Tools

Isaac S. Kohane, MD, PhD, Chair of the Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research, stressed the urgency of the findings: "LLMs have become patients' first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm. When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional"

1

.

The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on AI health guidance. With OpenAI reporting that about 40 million people were using the tool daily within weeks of its release, the gap between widespread adoption and independent safety evaluation raises questions about oversight of consumer AI tools in healthcare. The research team's motivation was straightforward, according to Ramaswamy: "We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" The answer, evidently, is not reliably enough

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo