2 Sources
2 Sources
[1]
ChatGPT Health fails critical emergency and suicide safety tests
Mount Sinai Health SystemFeb 24 2026 ChatGPT Health, a widely used consumer artificial intelligence (AI) tool that provides health guidance directly to the public-including advice about how urgently to seek medical care-may fail to direct users appropriately to emergency care in a significant number of serious cases, according to researchers at the Icahn School of Medicine at Mount Sinai. The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is the first independent safety evaluation of the large language model (LLM)-based tool since its January 2026 launch. It also identified serious concerns with the tool's suicide-crisis safeguards. "LLMs have become patients' first stop for medical advice-but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm," says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional." Within weeks of its release, ChatGPT Health's maker, OpenAI, reported that about 40 million people were using the tool daily to seek health information and guidance, including advice about whether to seek urgent or emergency care. At the same time, say the investigators, there was little independent evidence about how safe or reliable its advice actually was. That gap motivated our study. We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" Ashwin Ramaswamy, MD, lead author, Instructor of Urology, Icahn School of Medicine, Mount Sinai With respect to suicide-risk alerts, ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. However, the investigators found that these alerts appeared inconsistently, sometimes triggering in lower-risk scenarios while-alarmingly-failing to appear when users described specific plans for self-harm. "This was a particularly surprising and concerning finding," says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. "While we expected some variability, what we observed went beyond inconsistency. The system's alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less." As part of the evaluation, the research team created 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies. Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics (such as someone minimizing symptoms), and barriers to care like lack of insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health and compared its recommendations with physician consensus. In testing the 60 realistic patient scenarios developed by physicians, the researchers found that while the tool generally handled clear-cut emergencies correctly, it under-triaged more than half of cases that physicians determined required emergency care. The investigators were also struck by how the system failed in emergency medical cases. The tool often demonstrated that it recognized dangerous findings in its own explanations, yet still reassured the patient. "ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions," says Dr. Ramaswamy. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment." The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on chatbot guidance. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency department. Still, the researchers emphasize that the findings do not suggest consumers should abandon AI health tools altogether. "As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgment," says Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study. "These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients." The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation, the researchers say. "Starting medical training alongside tools that are evolving in real time makes it clear that today's results are not set in stone," Ms. Tyagi says. "That reality calls for ongoing review to ensure that improvements in technology translate into safer care." The team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, expanding future research into areas such as pediatric care, medication safety, and non-English-language use. The paper is titled "ChatGPT Health performance in a structured test of triage recommendations." The study's authors, as listed in the journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH. Mount Sinai Health System Journal reference: Ramaswamy, A., et al. (2026). ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. DOI: 10.1038/s41591-026-04297-7. https://www.nature.com/articles/s41591-026-04297-7
[2]
Research Identifies Blind Spots in AI Medical Triage | Newswise
Newswise -- New York, NY [February 24, 2026] -- ChatGPT Health, a widely used consumer artificial intelligence (AI) tool that provides health guidance directly to the public -- including advice about how urgently to seek medical care -- may fail to direct users appropriately to emergency care in a significant number of serious cases, according to researchers at the Icahn School of Medicine at Mount Sinai. The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is the first independent safety evaluation of the large language model (LLM)-based tool since its January 2026 launch. It also identified serious concerns with the tool's suicide-crisis safeguards. "LLMs have become patients' first stop for medical advice -- but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm," says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional." Within weeks of its release, ChatGPT Health's maker, OpenAI, reported that about 40 million people were using the tool daily to seek health information and guidance, including advice about whether to seek urgent or emergency care. At the same time, say the investigators, there was little independent evidence about how safe or reliable its advice actually was. "That gap motivated our study," says lead author Ashwin Ramaswamy, MD, Instructor of Urology at the Icahn School of Medicine at Mount Sinai. "We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" With respect to suicide-risk alerts, ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. However, the investigators found that these alerts appeared inconsistently, sometimes triggering in lower-risk scenarios while -- alarmingly -- failing to appear when users described specific plans for self-harm. "This was a particularly surprising and concerning finding," says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. "While we expected some variability, what we observed went beyond inconsistency. The system's alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less." As part of the evaluation, the research team created 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies. Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics (such as someone minimizing symptoms), and barriers to care like lack of insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health and compared its recommendations with physician consensus. In testing the 60 realistic patient scenarios developed by physicians, the researchers found that while the tool generally handled clear-cut emergencies correctly, it under-triaged more than half of cases that physicians determined required emergency care. The investigators were also struck by how the system failed in emergency medical cases. The tool often demonstrated that it recognized dangerous findings in its own explanations, yet still reassured the patient. "ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions," says Dr. Ramaswamy. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment." The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on chatbot guidance. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency department. Still, the researchers emphasize that the findings do not suggest consumers should abandon AI health tools altogether. "As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgment," says Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study. "These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients." The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation, the researchers say. "Starting medical training alongside tools that are evolving in real time makes it clear that today's results are not set in stone," Ms. Tyagi says. "That reality calls for ongoing review to ensure that improvements in technology translate into safer care." The team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, expanding future research into areas such as pediatric care, medication safety, and non-English-language use. The paper is titled "ChatGPT Health performance in a structured test of triage recommendations." The study's authors, as listed in the journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH. For more Mount Sinai artificial intelligence news, visit: https://icahn.mssm.edu/about/artificial-intelligence. About Mount Sinai's Windreich Department of AI and Human Health Led by Girish N. Nadkarni, MD, MPH -- an international authority on the safe, effective, and ethical use of AI in health care -- Mount Sinai's Windreich Department of AI and Human Health is the first of its kind at a U.S. medical school, pioneering transformative advancements at the intersection of artificial intelligence and human health. The Department is committed to leveraging AI in a responsible, effective, ethical, and safe manner to transform research, clinical care, education, and operations. By bringing together world-class AI expertise, cutting-edge infrastructure, and unparalleled computational power, the department is advancing breakthroughs in multi-scale, multimodal data integration while streamlining pathways for rapid testing and translation into practice. The Department benefits from dynamic collaborations across Mount Sinai, including with the Hasso Plattner Institute for Digital Health at Mount Sinai -- a partnership between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System -- which complements its mission by advancing data-driven approaches to improve patient care and health outcomes. At the heart of this innovation is the renowned Icahn School of Medicine at Mount Sinai, which serves as a central hub for learning and collaboration. This unique integration enables dynamic partnerships across institutes, academic departments, hospitals, and outpatient centers, driving progress in disease prevention, improving treatments for complex illnesses, and elevating quality of life on a global scale. In 2024, the Department's innovative NutriScan AI application, developed by the Mount Sinai Health System Clinical Data Science team in partnership with Department faculty, earned Mount Sinai Health System the prestigious Hearst Health Prize. NutriScan is designed to facilitate faster identification and treatment of malnutrition in hospitalized patients. This machine learning tool improves malnutrition diagnosis rates and resource utilization, demonstrating the impactful application of AI in health care. For more information on Mount Sinai's Windreich Department of AI and Human Health, visit: ai.mssm.edu About the Hasso Plattner Institute at Mount Sinai At the Hasso Plattner Institute for Digital Health at Mount Sinai, the tools of data science, biomedical and digital engineering, and medical expertise are used to improve and extend lives. The Institute represents a collaboration between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System. Under the leadership of Girish Nadkarni, MD, MPH, who directs the Institute, and Professor Lothar Wieler, a globally recognized expert in public health and digital transformation, they jointly oversee the partnership, driving innovations that positively impact patient lives while transforming how people think about personal health and health systems. The Hasso Plattner Institute for Digital Health at Mount Sinai receives generous support from the Hasso Plattner Foundation. Current research programs and machine learning efforts focus on improving the ability to diagnose and treat patients. About the Icahn School of Medicine at Mount Sinai The Icahn School of Medicine at Mount Sinai is internationally renowned for its outstanding research, educational, and clinical care programs. It is the sole academic partner for the seven member hospitals* of the Mount Sinai Health System, one of the largest academic health systems in the United States, providing care to New York City's large and diverse patient population. The Icahn School of Medicine at Mount Sinai offers highly competitive MD, PhD, MD-PhD, and master's degree programs, with enrollment of more than 1,200 students. It has the largest graduate medical education program in the country, with more than 2,600 clinical residents and fellows training throughout the Health System. Its Graduate School of Biomedical Sciences offers 13 degree-granting programs, conducts innovative basic and translational research, and trains more than 560 postdoctoral research fellows. Ranked 11th nationwide in National Institutes of Health (NIH) funding, the Icahn School of Medicine at Mount Sinai is among the 99th percentile in research dollars per investigator according to the Association of American Medical Colleges.  More than 4,500 scientists, educators, and clinicians work within and across dozens of academic departments and multidisciplinary institutes with an emphasis on translational research and therapeutics. Through Mount Sinai Innovation Partners (MSIP), the Health System facilitates the real-world application and commercialization of medical breakthroughs made at Mount Sinai.
Share
Share
Copy Link
A Mount Sinai study published in Nature Medicine found that ChatGPT Health, used by 40 million people daily, failed to direct users appropriately to emergency care in more than half of serious cases. The safety evaluation also revealed alarming inconsistencies in suicide-crisis safeguards, with alerts appearing less reliably when users described specific self-harm plans.
ChatGPT Health, the consumer AI tool providing health guidance to approximately 40 million daily users, fails critical emergency safety tests and shows dangerous inconsistencies in suicide-crisis safeguards, according to a study published February 23, 2026, in Nature Medicine
1
2
. Researchers at the Icahn School of Medicine at Mount Sinai Health System conducted the first independent safety evaluation of the large language model (LLM-based tool since OpenAI launched it in January 2026. The findings reveal significant blind spots in AI medical triage that could put users at risk when seeking guidance on whether to seek urgent or emergency care.
Source: News-Medical
The study tested 60 structured clinical scenarios spanning 21 medical specialties, with three independent physicians determining the correct level of urgency using guidelines from 56 medical societies. Each patient scenario was evaluated under 16 different contextual conditions, including variations in race, gender, social dynamics, and barriers to care like lack of insurance or transportation. This rigorous methodology resulted in 960 interactions with ChatGPT Health, compared against physician consensus to assess triage recommendations
1
.The research revealed that while ChatGPT Health generally handled textbook emergencies like stroke or severe allergic reactions correctly, it under-triaged emergency cases more than half the time when physicians determined emergency room visits were necessary. Lead author Ashwin Ramaswamy, MD, Instructor of Urology at the Icahn School of Medicine, explained the tool struggled most in nuanced situations where danger isn't immediately obvious—precisely when clinical judgment matters most
2
.In one particularly concerning example involving an asthma scenario, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment. This pattern repeated across multiple serious medical situations: the tool often demonstrated recognition of dangerous findings in its own explanations yet still reassured patients instead of directing them to immediate care
1
.Perhaps most troubling, the study identified serious flaws in ChatGPT Health's suicide-risk alerts, which were designed to direct users to the 988 Crisis Lifeline in high-risk scenarios. The alerts appeared inconsistently, sometimes triggering in lower-risk situations while failing to appear when users described specific plans for self-harm. Senior author Girish N. Nadkarni, MD, MPH, Chief AI Officer of the Mount Sinai Health System, described the findings as "particularly surprising and concerning," noting that the system's alerts were "inverted relative to clinical risk"
1
."In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less," Nadkarni emphasized. This inversion of risk assessment represents a critical failure in the tool's ability to identify and respond appropriately to users in crisis
2
.Related Stories
Isaac S. Kohane, MD, PhD, Chair of the Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research, stressed the urgency of the findings: "LLMs have become patients' first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm. When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional"
1
.The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on AI health guidance. With OpenAI reporting that about 40 million people were using the tool daily within weeks of its release, the gap between widespread adoption and independent safety evaluation raises questions about oversight of consumer AI tools in healthcare. The research team's motivation was straightforward, according to Ramaswamy: "We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" The answer, evidently, is not reliably enough
2
.Summarized by
Navi
[1]
1
Technology

2
Policy and Regulation

3
Policy and Regulation
