ChatGPT Health fails to recognize half of medical emergencies in first independent safety test

Reviewed byNidhi Govil

9 Sources

Share

OpenAI's ChatGPT Health missed over half of medical emergencies in a Nature Medicine study, directing patients to routine appointments instead of emergency rooms. With 40 million daily users seeking health guidance, the AI tool also showed alarming inconsistencies in suicide-crisis safeguards, triggering alerts for low-risk cases while failing to respond when users described specific self-harm plans.

News article

ChatGPT Health Under-Triaged 51.6% of Emergency Cases

ChatGPT Health, OpenAI's dedicated consumer AI tool for health guidance, failed to appropriately direct users to emergency care in more than half of serious medical emergencies, according to the first independent safety evaluation published in Nature Medicine

1

. The study, conducted by researchers at Mount Sinai Health System, tested 60 realistic patient scenarios spanning 21 medical specialties across 960 interactions and found that the AI for health guidance under-triaged 51.6% of cases that physicians determined required immediate emergency care

3

. Instead of recommending emergency room visits, OpenAI's ChatGPT Health directed patients experiencing life-threatening conditions like diabetic ketoacidosis and respiratory failure to schedule routine appointments within 24 to 48 hours

4

.

The stakes are extraordinarily high given that approximately 40 million people use the tool daily to seek health information and decide whether to seek urgent medical crises care

1

. Lead author Ashwin Ramaswamy, an Instructor of Urology at the Icahn School of Medicine, explained that the study aimed to answer a basic but critical question: "If someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?"

5

.

AI Safety Concerns in Nuanced Medical Situations

While ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions—correctly triaging these 100% of the time—it struggled significantly in more nuanced situations where the danger is not immediately obvious

4

. In one asthma scenario, the system identified early warning signs of respiratory failure in its own explanation but still advised waiting rather than seeking emergency treatment

1

. This paradoxical behavior reveals critical AI blind spots in clinical judgment where the language model (LLM) recognizes dangerous findings yet still reassures patients.

Doctoral researcher Alex Ruani, who studies health misinformation mitigation at University College London, described the findings as "unbelievably dangerous," noting that in one simulation, the platform sent a suffocating woman to a future appointment she wouldn't live to see eight times out of 10 attempts—an 84% failure rate

3

. Meanwhile, the chatbot guidance also over-triaged 64.8% of nonurgent cases, recommending immediate medical care for completely safe individuals

2

.

Inverted Suicide-Crisis Safeguards Raise Alarm

The independent safety evaluation revealed particularly concerning failures in suicide-crisis safeguards designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations

1

. Researchers found that these alerts appeared inconsistently and were "inverted relative to clinical risk," appearing more reliably for lower-risk scenarios while failing to appear when users described specific plans for self-harm

5

.

In testing suicidal ideation scenarios, Ramaswamy described a case where a 27-year-old patient described thinking about taking a lot of pills. When the patient described symptoms alone, the crisis intervention banner linking to suicide help services appeared every time. However, when normal lab results were added to the same patient scenario with identical words and severity, the banner vanished—zero out of 16 attempts

3

. Senior study author Girish N. Nadkarni, Chief AI Officer of the Mount Sinai Health System, noted that "when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less"

1

.

Rigorous Testing Methodology Across Patient Scenarios

The research team created 60 structured clinical scenarios covering conditions from mild illnesses to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies

5

. Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics such as someone minimizing symptoms, and barriers to care like lack of insurance or transportation

1

. The variations were designed to produce the exact same triage recommendations regardless of demographic changes, and the study found no significant differences based on these factors

4

.

Notably, the platform was nearly 12 times more likely to downplay symptoms when the "patient" mentioned a "friend" in the scenario suggested it was nothing serious

3

. This susceptibility to social influence demonstrates how the consumer AI tool fails to recognize medical emergencies when contextual noise is introduced.

Expert Calls for Mandatory Independent Oversight

Isaac S. Kohane, Chair of the Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research, emphasized that "LLMs have become patients' first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm. When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional"

1

.

Prof. Paul Henman, a digital sociologist at the University of Queensland, warned that if ChatGPT Health was used by people at home, "it could lead to higher numbers of unnecessary medical presentations for low-level conditions, and a failure of people to obtain urgent medical care when required, which could feasibly lead to unnecessary harm and death"

3

. He also raised concerns about legal liability, noting that a suite of legal cases against tech companies are already in motion related to suicide and self-harm after using AI chatbots.

OpenAI Response and Future Implications

An OpenAI spokesperson told media outlets that while the company welcomed independent research evaluating AI systems in healthcare, the study did not reflect how people typically use ChatGPT Health in real life

3

. The spokesperson emphasized that the model is continuously updated and refined, and that the chatbot is designed for users to ask follow-up questions to provide more context rather than give single responses to medical scenarios

4

. ChatGPT Health is currently available only to a limited number of users on a waitlist, and OpenAI is working to improve safety and reliability before wider release.

However, researchers and experts maintain that a plausible risk of harm is sufficient to justify stronger safeguards and independent oversight

3

. Dr. John Mafi, an associate professor of medicine at UCLA Health, stressed that "before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you're making sure that the benefits outweigh the harms"

4

. The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on AI tools.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo