9 Sources
9 Sources
[1]
ChatGPT Health fails critical emergency and suicide safety tests
Mount Sinai Health SystemFeb 24 2026 ChatGPT Health, a widely used consumer artificial intelligence (AI) tool that provides health guidance directly to the public-including advice about how urgently to seek medical care-may fail to direct users appropriately to emergency care in a significant number of serious cases, according to researchers at the Icahn School of Medicine at Mount Sinai. The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is the first independent safety evaluation of the large language model (LLM)-based tool since its January 2026 launch. It also identified serious concerns with the tool's suicide-crisis safeguards. "LLMs have become patients' first stop for medical advice-but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm," says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional." Within weeks of its release, ChatGPT Health's maker, OpenAI, reported that about 40 million people were using the tool daily to seek health information and guidance, including advice about whether to seek urgent or emergency care. At the same time, say the investigators, there was little independent evidence about how safe or reliable its advice actually was. That gap motivated our study. We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" Ashwin Ramaswamy, MD, lead author, Instructor of Urology, Icahn School of Medicine, Mount Sinai With respect to suicide-risk alerts, ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. However, the investigators found that these alerts appeared inconsistently, sometimes triggering in lower-risk scenarios while-alarmingly-failing to appear when users described specific plans for self-harm. "This was a particularly surprising and concerning finding," says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. "While we expected some variability, what we observed went beyond inconsistency. The system's alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less." As part of the evaluation, the research team created 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies. Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics (such as someone minimizing symptoms), and barriers to care like lack of insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health and compared its recommendations with physician consensus. In testing the 60 realistic patient scenarios developed by physicians, the researchers found that while the tool generally handled clear-cut emergencies correctly, it under-triaged more than half of cases that physicians determined required emergency care. The investigators were also struck by how the system failed in emergency medical cases. The tool often demonstrated that it recognized dangerous findings in its own explanations, yet still reassured the patient. "ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions," says Dr. Ramaswamy. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment." The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on chatbot guidance. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency department. Still, the researchers emphasize that the findings do not suggest consumers should abandon AI health tools altogether. "As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgment," says Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study. "These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients." The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation, the researchers say. "Starting medical training alongside tools that are evolving in real time makes it clear that today's results are not set in stone," Ms. Tyagi says. "That reality calls for ongoing review to ensure that improvements in technology translate into safer care." The team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, expanding future research into areas such as pediatric care, medication safety, and non-English-language use. The paper is titled "ChatGPT Health performance in a structured test of triage recommendations." The study's authors, as listed in the journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH. Mount Sinai Health System Journal reference: Ramaswamy, A., et al. (2026). ChatGPT Health performance in a structured test of triage recommendations. Nature Medicine. DOI: 10.1038/s41591-026-04297-7. https://www.nature.com/articles/s41591-026-04297-7
[2]
ChatGPT Health misses urgent medical crises over 50% of the time
While OpenAI claims continuous model refinement and disputes the study's real-world applicability, the research highlights current limitations in AI medical assessment tools. According to new research published in Nature Medicine, ChatGPT Health (OpenAI's dedicated AI-driven chatbot that's "designed for health and wellness," which launched earlier this year) repeatedly failed to identify medical emergencies that required immediate medical attention, reports The Guardian. Lead researcher Dr. Ashwin Ramaswamy, along with his colleagues, created "60 realistic patient scenarios covering health conditions from mild illnesses to emergencies," which were reviewed by independent doctors based on established clinical guidelines. In 51.6% of the cases where patients should've been sent to the hospital for emergency care, they were instead advised to stay home and/or book a regular doctor's appointment. ChatGPT Health performed well enough in clear-cut emergency situations, such as in the case of strokes and severe allergic reactions, it didn't fare so well when symptoms were more complex and weren't yet emergencies but could become life-threatening very quickly. "If you're experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it's not a big deal," said doctoral researcher Alex Ruani. "Eight times out of 10, [ChatGPT Health] sent a suffocating woman to a future appointment she would not live to see. [...] Meanwhile, 64.8% of completely safe individuals were told to seek immediate medical care." OpenAI told The Guardian that these results don't reflect how the service is normally used and that the model is continuously refined.
[3]
'Unbelievably dangerous': experts sound alarm after ChatGPT Health fails to recognise medical emergencies
ChatGPT Health regularly misses the need for medical urgent care and frequently fails to detect suicidal ideation, a study of the AI platform has found, which experts worry could "feasibly lead to unnecessary harm and death". OpenAI launched the "Health" feature of ChatGPTto limited audiences in January, which it promotes as a way for users to "securely connect medical records and wellness apps" to generate health advice and responses. More than 40 million people reportedly ask ChatGPT for health-related advice every day. The first independent safety evaluation of ChatGPT Health, published in the February edition of the journal Nature Medicine, found it under-triaged more than half of the cases presented to it. Lead author of the study, Dr Ashwin Ramaswamy, said "we wanted to answer the most basic safety question; if someone is having a real medical emergency and asks ChatGPT Health what to do, will it tell them to go to the emergency department?" Ramaswamy and his colleagues created 60 realistic patient scenarios covering health conditions from mild illnesses to emergencies. Three independent doctors reviewed each scenario and agreed on the level of care needed, based on clinical guidelines. The team then asked ChatGPT Health for advice on each case under different conditions, including changing the patient's gender, adding test results, or adding comments from family members, generating nearly 1,000 responses. They then compared the platform's recommendations with the doctors' assessments. While it performed well in textbook emergencies such as stroke or severe allergic reactions, it struggled in other situations. In one asthma scenario, it advised waiting rather than seeking emergency treatment despite the platform identifying early warning signs of respiratory failure. In 51.6% of cases where someone needed to go to the hospital immediately, the platform said stay home or book a routine medical appointment, a result Alex Ruani, a doctoral researcher in health misinformation mitigation with University College London described as "unbelievably dangerous". "If you're experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it's not a big deal," she said. "What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life." In one of the simulations, eight times out of 10 (84%), the platform sent a suffocating woman to a future appointment she wouldn't live to see, Ruani said. Meanwhile, 64.8% of completely safe individuals were told to seek immediate medical care, Ruani, who was not involved in the study, said. The platform was also nearly 12 times more likely to downplay symptoms because the "patient" told it a "friend" in the scenario suggested it was nothing serious. "It is why many of us studying these systems are focused on urgently developing clear safety standards and independent auditing mechanisms to reduce preventable harm," Ruani said. A spokesperson for OpenAI said while the company welcomed independent research evaluating AI systems in healthcare, the study did not reflect how people typically use ChatGPT Health in real life. The model is also continuously updated and refined, the spokesperson said. Ruani said even though simulations created by the researchers were used, "a plausible risk of harm is enough to justify stronger safeguards and independent oversight". Ramaswamy, a urology instructor at the Icahn School of Medicine at Mount Sinai in the US, said he was particularly concerned by the platform's under-reaction to suicide ideation. "We tested ChatGPT Health with a 27-year-old patient who said he'd been thinking about taking a lot of pills," he said. When the patient described his symptoms alone, the crisis intervention banner linking to suicide help services appeared every time. "Then we added normal lab results," Ramaswamy said. "Same patient, same words, same severity. The banner vanished. Zero out of 16 attempts. A crisis guardrail that depends on whether you mentioned your labs is not ready, and it's arguably more dangerous than having no guardrail at all, because no one can predict when it will fail." Prof. Paul Henman, a digital sociologist and policy expert with the University of Queensland, said; "This is a really important paper". "If ChatGPT Health was used by people at home, it could lead to higher numbers of unnecessary medical presentations for low-level conditions, and a failure of people to obtain urgent medical care when required, which could feasibly lead to unnecessary harm and death." He said it also raised the prospects of legal liability, with a suite of legal cases against tech companies already in motion in relation to suicide and self-harm after using AI chatbots. "It is not clear what OpenAI is seeking to achieve by creating this product, how it was trained, what guardrails it has introduced and what warnings it provides to users," Henman said. "Because we don't know how ChatGPT Health was trained and what the context it was using, we don't really know what is embedded into its models."
[4]
ChatGPT Health 'under-triaged' half of medical emergencies in a new study
ChatGPT Health -- OpenAI's new health-focused chatbot -- frequently underestimated the severity of medical emergencies, according to a study published last week in the journal Nature Medicine. In the study, researchers tested ChatGPT Health's ability to triage, or assess the severity of, medical cases based on real-life scenarios. Previous research has shown that ChatGPT can pass medical exams, and nearly two-thirds of physicians reported using some form of AI in 2024. But other research has shown that chatbots, including ChatGPT, don't provide reliable medical advice. ChatGPT Health is separate from OpenAI's general ChatGPT chatbot. The program is free, but users must sign up specifically to use the health program, which currently has a waitlist to join. OpenAI says ChatGPT Health uses a more secure platform so users can safely upload personal medical information. Over 40 million people globally use ChatGPT to answer health care questions, and nearly 2 million weekly ChatGPT messages are about insurance, according to OpenAI. In a detailed description of ChatGPT Health on its website, OpenAI says that it is "not intended for diagnosis or treatment." In the study, the researchers fed 60 medical scenarios to ChatGPT Health. The chatbot's responses were compared with the responses of three physicians who also reviewed the scenarios and triaged each one based on medical guidelines and clinical expertise. Each of the scenarios had 16 variations, changing things including the race or gender of the patient. The variations were designed to "produce the exact same result," according to lead study author Dr. Ashwin Ramaswamy, an instructor of urology at The Mount Sinai Hospital in New York City. This meant that an emergency case involving a man should still be classified as an emergency if the patient was a woman. The study didn't find any significant differences in the results based on demographic changes. The researchers found that ChatGPT Health "under-triaged" 51.6% of emergency cases. That is, instead of recommending the patient go to the emergency room, the bot recommended seeing a doctor within 24 to 48 hours. The emergencies included a patient with a life-threatening diabetes complication called diabetic ketoacidosis and a patient going into respiratory failure. Left untreated, both lead to death. "Any doctor, and any person who's gone through any degree of training, would say that that patient needs to go to the emergency department," Ramaswamy said. In cases like impending respiratory failure, the bot seemed to be "waiting for the emergency to become undeniable" before recommending the ER, he said. Emergencies like stroke, with unmistakable symptoms, were correctly triaged 100% of the time, the study found. A spokesperson for OpenAI said the company welcomed research looking at the use of AI in health care, but said the new study didn't reflect how ChatGPT Health is typically used or how it's designed to function. The chatbot is designed for people to ask follow-up questions to give more context in medical situations, rather than give a single response to a medical scenario, the spokesperson said. ChatGPT Health is only currently available to a limited number of users, and OpenAI is still working to improve the safety and reliability of the model before the chatbot is made more widely available, the spokesperson said. Compared with the doctors in the study, the bot also over-triaged 64.8% of nonurgent cases, recommending a doctor's appointment when it wasn't necessary. The bot told a patient with a three-day sore throat to see a doctor in 24 to 48 hours, when at-home care was sufficient. "There's no logic, for me, as to why it was making recommendations in some areas versus others," Ramaswamy said. In suicidal ideation or self-harm scenarios, the bot's response was also inconsistent. When a user expresses suicidal intent, ChatGPT is supposed to refer users to 988, the suicide and crisis hotline. ChatGPT Health works the same way, the OpenAI spokesperson said. In the study, however, ChatGPT Health instead referred users to 988 when they didn't need it, and didn't refer users to it when necessary. Ramaswamy called the bot "paradoxical." "It was inverted to clinical risk," he said. "And it was kind of backwards." 'A medical therapist' Dr. John Mafi, an associate professor of medicine and a primary care physician at UCLA Health who wasn't involved with the research, said more testing is needed on chatbots that can make health decisions. "The message of this study is that before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you're making sure that the benefits outweigh the harms," Mafi said. Both Mafi and Ramaswamy said they've seen a number of their own patients using AI for medical questions. Ramaswamy said people may turn to AI for health advice because it's easy to access and has no limit on the number of questions a person can ask. "You can go through every question, every detail, every document that you want to upload," Ramaswamy said. "And it fulfills that need. People really, really want not just medical advice, but they also want a partner, like a medical therapist." OpenAI said in a January report that a majority of ChatGPT's health-related messages occur outside of a doctor's normal working hours, and over half a million weekly messages came from people living 30 or more minutes away from a hospital. "A doctor can spend 15, 20 minutes with you in the room," Ramaswamy said. "They're not going to be able to address and answer every single question." Risks of using a chatbot for medical advice Despite the benefits of its endless availability, when asked whether chatbots can currently safely provide health and medical advice, Ramaswamy said no. Dr. Ethan Goh, executive director of ARISE, an AI research network, said that in many instances, AI can provide safe health and medical advice, but that it's not a substitute for a physician's advice. "The reality is chatbots can be helpful for a vast number of things. It's really more about being thoughtful and being deliberate and understanding that it also has severe limitations," he said. Monica Agrawal, an assistant professor in the department of biostatistics and bioinformatics and the department of computer science at Duke University, said it's largely unknown how AI models are trained and what data is used to train them. She said some training benchmarks may not indicate a bot's potential to help. "A lot of [OpenAI's] earlier evaluations were based on, 'We do this well on a licensing exam,'" she said. "But there's a huge difference between doing well on a medical exam and actually practicing medicine." She added that when people use chatbots, the information users give is not always clear and can contain biases. "Large language models are known for being sycophantic," she said. "Which means they tend to agree with opinions posited by the user, even if they might not be correct. And this has the ability to reinforce patient misconceptions or biases." Mafi said AI tools are "designed to please you," but as a doctor, "sometimes you have to say something that may not please the patient." Ramaswamy said not to rely on AI in an emergency, and using it in conjunction with a physician is key to preventing harm. He said collaborations between tech and health care companies are important for creating safer AI products. "If these models get better and better, I can see the benefits of a patient-AI-doctor relationship, especially in rural scenarios, or in areas of global health," he said.
[5]
Research Identifies Blind Spots in AI Medical Triage | Newswise
Newswise -- New York, NY [February 24, 2026] -- ChatGPT Health, a widely used consumer artificial intelligence (AI) tool that provides health guidance directly to the public -- including advice about how urgently to seek medical care -- may fail to direct users appropriately to emergency care in a significant number of serious cases, according to researchers at the Icahn School of Medicine at Mount Sinai. The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is the first independent safety evaluation of the large language model (LLM)-based tool since its January 2026 launch. It also identified serious concerns with the tool's suicide-crisis safeguards. "LLMs have become patients' first stop for medical advice -- but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm," says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional." Within weeks of its release, ChatGPT Health's maker, OpenAI, reported that about 40 million people were using the tool daily to seek health information and guidance, including advice about whether to seek urgent or emergency care. At the same time, say the investigators, there was little independent evidence about how safe or reliable its advice actually was. "That gap motivated our study," says lead author Ashwin Ramaswamy, MD, Instructor of Urology at the Icahn School of Medicine at Mount Sinai. "We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" With respect to suicide-risk alerts, ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. However, the investigators found that these alerts appeared inconsistently, sometimes triggering in lower-risk scenarios while -- alarmingly -- failing to appear when users described specific plans for self-harm. "This was a particularly surprising and concerning finding," says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. "While we expected some variability, what we observed went beyond inconsistency. The system's alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less." As part of the evaluation, the research team created 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies. Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics (such as someone minimizing symptoms), and barriers to care like lack of insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health and compared its recommendations with physician consensus. In testing the 60 realistic patient scenarios developed by physicians, the researchers found that while the tool generally handled clear-cut emergencies correctly, it under-triaged more than half of cases that physicians determined required emergency care. The investigators were also struck by how the system failed in emergency medical cases. The tool often demonstrated that it recognized dangerous findings in its own explanations, yet still reassured the patient. "ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions," says Dr. Ramaswamy. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment." The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on chatbot guidance. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency department. Still, the researchers emphasize that the findings do not suggest consumers should abandon AI health tools altogether. "As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgment," says Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study. "These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients." The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation, the researchers say. "Starting medical training alongside tools that are evolving in real time makes it clear that today's results are not set in stone," Ms. Tyagi says. "That reality calls for ongoing review to ensure that improvements in technology translate into safer care." The team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, expanding future research into areas such as pediatric care, medication safety, and non-English-language use. The paper is titled "ChatGPT Health performance in a structured test of triage recommendations." The study's authors, as listed in the journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH. For more Mount Sinai artificial intelligence news, visit: https://icahn.mssm.edu/about/artificial-intelligence. About Mount Sinai's Windreich Department of AI and Human Health Led by Girish N. Nadkarni, MD, MPH -- an international authority on the safe, effective, and ethical use of AI in health care -- Mount Sinai's Windreich Department of AI and Human Health is the first of its kind at a U.S. medical school, pioneering transformative advancements at the intersection of artificial intelligence and human health. The Department is committed to leveraging AI in a responsible, effective, ethical, and safe manner to transform research, clinical care, education, and operations. By bringing together world-class AI expertise, cutting-edge infrastructure, and unparalleled computational power, the department is advancing breakthroughs in multi-scale, multimodal data integration while streamlining pathways for rapid testing and translation into practice. The Department benefits from dynamic collaborations across Mount Sinai, including with the Hasso Plattner Institute for Digital Health at Mount Sinai -- a partnership between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System -- which complements its mission by advancing data-driven approaches to improve patient care and health outcomes. At the heart of this innovation is the renowned Icahn School of Medicine at Mount Sinai, which serves as a central hub for learning and collaboration. This unique integration enables dynamic partnerships across institutes, academic departments, hospitals, and outpatient centers, driving progress in disease prevention, improving treatments for complex illnesses, and elevating quality of life on a global scale. In 2024, the Department's innovative NutriScan AI application, developed by the Mount Sinai Health System Clinical Data Science team in partnership with Department faculty, earned Mount Sinai Health System the prestigious Hearst Health Prize. NutriScan is designed to facilitate faster identification and treatment of malnutrition in hospitalized patients. This machine learning tool improves malnutrition diagnosis rates and resource utilization, demonstrating the impactful application of AI in health care. For more information on Mount Sinai's Windreich Department of AI and Human Health, visit: ai.mssm.edu About the Hasso Plattner Institute at Mount Sinai At the Hasso Plattner Institute for Digital Health at Mount Sinai, the tools of data science, biomedical and digital engineering, and medical expertise are used to improve and extend lives. The Institute represents a collaboration between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System. Under the leadership of Girish Nadkarni, MD, MPH, who directs the Institute, and Professor Lothar Wieler, a globally recognized expert in public health and digital transformation, they jointly oversee the partnership, driving innovations that positively impact patient lives while transforming how people think about personal health and health systems. The Hasso Plattner Institute for Digital Health at Mount Sinai receives generous support from the Hasso Plattner Foundation. Current research programs and machine learning efforts focus on improving the ability to diagnose and treat patients. About the Icahn School of Medicine at Mount Sinai The Icahn School of Medicine at Mount Sinai is internationally renowned for its outstanding research, educational, and clinical care programs. It is the sole academic partner for the seven member hospitals* of the Mount Sinai Health System, one of the largest academic health systems in the United States, providing care to New York City's large and diverse patient population. The Icahn School of Medicine at Mount Sinai offers highly competitive MD, PhD, MD-PhD, and master's degree programs, with enrollment of more than 1,200 students. It has the largest graduate medical education program in the country, with more than 2,600 clinical residents and fellows training throughout the Health System. Its Graduate School of Biomedical Sciences offers 13 degree-granting programs, conducts innovative basic and translational research, and trains more than 560 postdoctoral research fellows. Ranked 11th nationwide in National Institutes of Health (NIH) funding, the Icahn School of Medicine at Mount Sinai is among the 99th percentile in research dollars per investigator according to the Association of American Medical Colleges.  More than 4,500 scientists, educators, and clinicians work within and across dozens of academic departments and multidisciplinary institutes with an emphasis on translational research and therapeutics. Through Mount Sinai Innovation Partners (MSIP), the Health System facilitates the real-world application and commercialization of medical breakthroughs made at Mount Sinai.
[6]
What to know before asking an AI chatbot for health advice
WASHINGTON (AP) -- With hundreds of millions of people turning to chatbots for advice, it was only a matter of time before tech companies began offering programs specifically designed to answer health questions. In January, OpenAI introduced ChatGPT Health, a new version of its chatbot that the company says can analyze users' medical records, wellness apps and wearable device data to answer health and medical questions. Currently, there's a waiting list for the program. Anthropic, a rival AI company, offers similar features for some users of its Claude chatbot. Both companies say their programs, known as large language models, aren't a substitute for professional care and shouldn't be used to diagnose medical conditions. Instead, they say the chatbots can summarize and explain complicated test results, help prepare for a doctor's visit or analyze important health trends buried in medical records and app metrics. Here are some things to consider before talking to a chatbot about your health: Chatbots can offer more personalized information than a Google search Some doctors and researchers who have worked with ChatGPT Health and similar programs see them as an improvement over the status quo. AI platforms are not perfect -- they can sometimes hallucinate or provide bad advice -- but the information they produce is more likely to be personalized and specific than what patients might find through a Google search. "The alternative often is nothing, or the patient winging it," said Dr. Robert Wachter, a medical technology expert at University of California, San Francisco. "And so I think that if you use these tools responsibly, I think you can get useful information." One advantage of the latest chatbots is that they answer users' questions with context from their medical history, including prescriptions, age and doctor's notes. Even if you haven't given AI access to your medical information, Wachter and others recommend giving the chatbots as many details as possible to improve responses. If you're having worrisome symptoms, skip AI Wachter and others stress that there are situations when people should skip the chatbot and seek immediate medical attention. Symptoms such as shortness of breath, chest pain or a severe headache could signal a medical emergency. Even during less urgent situations, patients and doctors should approach AI programs with "a degree of healthy skepticism," said Dr. Lloyd Minor of Stanford University. "If you're talking about a major medical decision, or even a smaller decision about your health, you should never be relying just on what you're getting out of a large language model," said Minor, who is the dean of Stanford's medical school. Consider your privacy before uploading any health data Many benefits offered by AI bots stem from users sharing personal medical information. But it's important to understand that anything shared with an AI company isn't protected by the federal privacy law that normally governs sensitive medical information. Commonly known as HIPAA, the law allows for fines and even prison time for doctors, hospitals, insurers or other health services that disclose medical records. But the law doesn't apply to companies that design chatbots. "When someone is uploading their medical chart into a large language model, that is very different than handing it to a new doctor," said Minor. "Consumers need to understand that they're completely different privacy standards." Both OpenAI and Anthropic say users' health information is kept separate from other types of data and is subject to additional privacy protections. The companies do not use health data to train their models. Users must opt in to share their information and can disconnect at any time. Testing shows chatbots can stumble Despite excitement surrounding AI, independent testing of the technology is in its infancy. Early studies suggest programs like ChatGPT can ace high-level medical exams but often stumble when interacting with humans. A 1,300-participant study by Oxford University recently found that people using AI chatbots to research hypothetical health conditions didn't make better decisions than people using online searches or personal judgment. AI chatbots presented with medical scenarios in a comprehensive, written form correctly identified the underlying condition 95% of the time. "That was not the problem," said lead author Adam Mahdi of the Oxford Internet Institute. "The place where things fell apart was during the interaction with the real participants." Mahdi and his team found several communication problems. People often didn't give the chatbots the necessary information to correctly identify the health issue. Conversely, the AI systems often responded with a combination of good and bad information, and users had trouble distinguishing between the two. The study, conducted in 2024, did not use the latest chatbot versions, including new offerings like ChatGPT Health. A second AI opinion can be helpful The ability for chatbots to ask follow-up questions and elicit key details from users is one area where Wachter sees room for improvement. "I think that's when this will get really good, when the tools become a little bit more doctor-ish in the way they go back and forth" with patients, Wachter said. For now, one way to feel more confident about the information you're getting is to consult multiple chatbots -- similar to getting a second opinion from another doctor. "I will sometimes put information into ChatGPT and information into Gemini," Wachter said, referencing Google's AI tool. "And when they both agree, I feel a little bit more secure that that's the right answer." ___ The Associated Press Health and Science Department receives support from the Howard Hughes Medical Institute's Department of Science Education and the Robert Wood Johnson Foundation. The AP is solely responsible for all content.
[7]
ChatGPT Health fails to spot 52% of medical emergencies in study
A study published in Nature Medicine on February 24 found that ChatGPT Health failed to direct users to emergency care in more than half of serious medical cases. Researchers at the Icahn School of Medicine at Mount Sinai conducted the evaluation, testing the consumer-facing tool across 960 interactions. The study highlights potential safety concerns regarding AI-powered triage as millions of users increasingly rely on chatbots for health guidance. The research team designed 60 clinical scenarios spanning 21 medical specialties. These cases ranged from minor conditions suitable for home care to genuine emergencies. Three independent physicians established the correct level of urgency for each scenario, utilizing guidelines from 56 medical societies. This consensus approach ensured a standardized benchmark for evaluating the AI's performance. Each scenario was then tested under 16 different contextual conditions, including variations in race, gender, social dynamics, and barriers to care such as lack of insurance. This methodology produced a total of 960 interactions with ChatGPT Health. The results revealed what the researchers described as an "inverted U-shaped" pattern of performance. ChatGPT Health handled textbook emergencies like stroke and anaphylaxis correctly. However, the tool under-triaged 52 percent of cases that physicians deemed true emergencies. For conditions such as diabetic ketoacidosis and impending respiratory failure, the AI directed patients toward a 24-to-48-hour evaluation instead of recommending immediate emergency department care. Additionally, the system misclassified 35 percent of non-urgent cases. A significant finding concerned the tool's susceptibility to anchoring bias. When family members or friends minimized symptoms within the prompts, triage recommendations shifted dramatically toward less urgent care. The study quantified this influence with an odds ratio of 11.7. Dr. Ashwin Ramaswamy, one of the study's corresponding authors, commented on the specific limitations observed. "ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions," Ramaswamy said. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most." The study also exposed inconsistencies in the tool's crisis intervention system. ChatGPT Health is designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. Researchers found that these alerts appeared more reliably when users described no specific method of self-harm than when they articulated a concrete plan. This observation effectively inverted the relationship between risk level and safeguard activation. Dr. Girish Nadkarni, Mount Sinai's Chief AI Officer and the study's other corresponding author, described the finding as going "beyond inconsistency." Nadkarni noted that "the system's alerts were inverted relative to clinical risk." The study's publication coincides with rapid consumer adoption of AI health tools. OpenAI launched ChatGPT Health in January 2026. The company reported that roughly 40 million people were using ChatGPT daily for health-related questions. Earlier in 2026, the nonprofit patient safety organization ECRI ranked misuse of AI chatbots in healthcare as the top health technology hazard. ECRI warned that these tools "can provide false or misleading information that could result in significant patient harm." The Mount Sinai team analyzed the influence of demographic and socioeconomic factors on triage outcomes. The data showed no statistically detectable effects from patient race, gender, or barriers to care. However, the study's confidence intervals did not rule out the possibility of clinically meaningful differences. The researchers indicated plans to continue evaluating updated versions of ChatGPT Health and other consumer AI tools. Future research will expand into pediatric care, medication safety, and non-English-language use.
[8]
What to Know Before Asking an AI Chatbot for Health Advice
This image provided by OpenAI in February 2026 demonstrates a health chatbot on a phone app. (OpenAI via AP) WASHINGTON (AP) -- With hundreds of millions of people turning to chatbots for advice, it was only a matter of time before tech companies began offering programs specifically designed to answer health questions. In January, OpenAI introduced ChatGPT Health, a new version of its chatbot that the company says can analyze users' medical records, wellness apps and wearable device data to answer health and medical questions. Currently, there's a waiting list for the program. Anthropic, a rival AI company, offers similar features for some users of its Claude chatbot. Both companies say their programs, known as large language models, aren't a substitute for professional care and shouldn't be used to diagnose medical conditions. Instead, they say the chatbots can summarize and explain complicated test results, help prepare for a doctor's visit or analyze important health trends buried in medical records and app metrics. Here are some things to consider before talking to a chatbot about your health: Chatbots can offer more personalized information than a Google search Some doctors and researchers who have worked with ChatGPT Health and similar programs see them as an improvement over the status quo. AI platforms are not perfect -- they can sometimes hallucinate or provide bad advice -- but the information they produce is more likely to be personalized and specific than what patients might find through a Google search. "The alternative often is nothing, or the patient winging it," said Dr. Robert Wachter, a medical technology expert at University of California, San Francisco. "And so I think that if you use these tools responsibly, I think you can get useful information." One advantage of the latest chatbots is that they answer users' questions with context from their medical history, including prescriptions, age and doctor's notes. Even if you haven't given AI access to your medical information, Wachter and others recommend giving the chatbots as many details as possible to improve responses. If you're having worrisome symptoms, skip AI Wachter and others stress that there are situations when people should skip the chatbot and seek immediate medical attention. Symptoms such as shortness of breath, chest pain or a severe headache could signal a medical emergency. Even during less urgent situations, patients and doctors should approach AI programs with "a degree of healthy skepticism," said Dr. Lloyd Minor of Stanford University. "If you're talking about a major medical decision, or even a smaller decision about your health, you should never be relying just on what you're getting out of a large language model," said Minor, who is the dean of Stanford's medical school. Consider your privacy before uploading any health data Many benefits offered by AI bots stem from users sharing personal medical information. But it's important to understand that anything shared with an AI company isn't protected by the federal privacy law that normally governs sensitive medical information. Commonly known as HIPAA, the law allows for fines and even prison time for doctors, hospitals, insurers or other health services that disclose medical records. But the law doesn't apply to companies that design chatbots. "When someone is uploading their medical chart into a large language model, that is very different than handing it to a new doctor," said Minor. "Consumers need to understand that they're completely different privacy standards." Both OpenAI and Anthropic say users' health information is kept separate from other types of data and is subject to additional privacy protections. The companies do not use health data to train their models. Users must opt in to share their information and can disconnect at any time. Testing shows chatbots can stumble Despite excitement surrounding AI, independent testing of the technology is in its infancy. Early studies suggest programs like ChatGPT can ace high-level medical exams but often stumble when interacting with humans. A 1,300-participant study by Oxford University recently found that people using AI chatbots to research hypothetical health conditions didn't make better decisions than people using online searches or personal judgment. AI chatbots presented with medical scenarios in a comprehensive, written form correctly identified the underlying condition 95% of the time. "That was not the problem," said lead author Adam Mahdi of the Oxford Internet Institute. "The place where things fell apart was during the interaction with the real participants." Mahdi and his team found several communication problems. People often didn't give the chatbots the necessary information to correctly identify the health issue. Conversely, the AI systems often responded with a combination of good and bad information, and users had trouble distinguishing between the two. The study, conducted in 2024, did not use the latest chatbot versions, including new offerings like ChatGPT Health. A second AI opinion can be helpful The ability for chatbots to ask follow-up questions and elicit key details from users is one area where Wachter sees room for improvement. "I think that's when this will get really good, when the tools become a little bit more doctor-ish in the way they go back and forth" with patients, Wachter said. For now, one way to feel more confident about the information you're getting is to consult multiple chatbots -- similar to getting a second opinion from another doctor. "I will sometimes put information into ChatGPT and information into Gemini," Wachter said, referencing Google's AI tool. "And when they both agree, I feel a little bit more secure that that's the right answer." ___ The Associated Press Health and Science Department receives support from the Howard Hughes Medical Institute's Department of Science Education and the Robert Wood Johnson Foundation. The AP is solely responsible for all content.
[9]
Where ChatGPT Health fails -- and how it could turn deadly
OpenAI last month introduced ChatGPT Health, a dedicated space in ChatGPT that allows users to ask health questions, analyze their medical records and connect to wellness apps. Now, weeks after its launch, researchers from the Icahn School of Medicine at Mount Sinai are raising concerns that the AI tool often fails to recommend urgent care in emergency cases and sometimes misses suicide-crisis alerts. "ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions," Dr. Ashwin Ramaswamy, instructor of urology at the Icahn School of Medicine at Mount Sinai, said in a statement. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most." OpenAI, the maker of ChatGPT, said in January that more than 40 million people use ChatGPT every day to address their healthcare concerns. Thus, ChatGPT Health was born -- it was initially released to a small group of users and piqued the curiosity of Mount Sinai researchers. "We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" Ramaswamy said. For his study, published this week in Nature Medicine, Ramaswamy's team devised 60 clinical scenarios spanning 21 medical specialties. Each scenario was tested 16 times, with conditions such as race, gender and lack of insurance changing each time to see if it led to a different outcome. In all, the researchers logged 960 interactions with ChatGPT Health. Its recommendations were compared to physician consensus. The study found that the tool failed to flag users to seek emergency care in 52% of serious cases. For example, ChatGPT Health identified early warning signs of respiratory failure in one asthma scenario, but suggested waiting instead of getting urgent treatment, Ramaswamy said. Alex Ruani, a doctoral researcher in health misinformation mitigation with University College London, called these inaccurate assessments "unbelievably dangerous." "If you're experiencing respiratory failure or diabetic ketoacidosis, you have a 50/50 chance of this AI telling you it's not a big deal," she told The Guardian. "What worries me most is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life." ChatGPT Health also irregularly alerted users to the 988 Suicide and Crisis Lifeline in high-risk situations, according to the research. Senior and co-corresponding study author Dr. Girish N. Nadkarni called this a "particularly surprising and concerning finding." "While we expected some variability, what we observed went beyond inconsistency," said Nadkarni, chief AI officer of the Mount Sinai Health System. "The system's alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves," he added. "In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less." ChatGPT and other chatbots have already been blamed in high-profile lawsuits for contributing to user suicides and mental health crises. The Post reached out to OpenAI for comment. A spokesperson told The Guardian that the study did not reflect real-life use of ChatGPT Health, a platform that's constantly updated and refined. The Mount Sinai doctors are not suggesting that users forgo AI health tools altogether, just that these systems should be closely monitored, independently evaluated and updated as needed. "We do believe that while there is a need for and a place for consumer-facing AI, there is potential for harm and thus an urgent need for independent evaluation and testing along with ongoing monitoring to establish failure modes and have engineering and human-centered safeguards to prevent adverse effects on people," Nadkarni and Ramaswamy told The Post. They plan to assess consumer-facing AI tools in areas such as pediatric care, medication safety and use by people who don't speak English.
Share
Share
Copy Link
OpenAI's ChatGPT Health missed over half of medical emergencies in a Nature Medicine study, directing patients to routine appointments instead of emergency rooms. With 40 million daily users seeking health guidance, the AI tool also showed alarming inconsistencies in suicide-crisis safeguards, triggering alerts for low-risk cases while failing to respond when users described specific self-harm plans.

ChatGPT Health, OpenAI's dedicated consumer AI tool for health guidance, failed to appropriately direct users to emergency care in more than half of serious medical emergencies, according to the first independent safety evaluation published in Nature Medicine
1
. The study, conducted by researchers at Mount Sinai Health System, tested 60 realistic patient scenarios spanning 21 medical specialties across 960 interactions and found that the AI for health guidance under-triaged 51.6% of cases that physicians determined required immediate emergency care3
. Instead of recommending emergency room visits, OpenAI's ChatGPT Health directed patients experiencing life-threatening conditions like diabetic ketoacidosis and respiratory failure to schedule routine appointments within 24 to 48 hours4
.The stakes are extraordinarily high given that approximately 40 million people use the tool daily to seek health information and decide whether to seek urgent medical crises care
1
. Lead author Ashwin Ramaswamy, an Instructor of Urology at the Icahn School of Medicine, explained that the study aimed to answer a basic but critical question: "If someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?"5
.While ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions—correctly triaging these 100% of the time—it struggled significantly in more nuanced situations where the danger is not immediately obvious
4
. In one asthma scenario, the system identified early warning signs of respiratory failure in its own explanation but still advised waiting rather than seeking emergency treatment1
. This paradoxical behavior reveals critical AI blind spots in clinical judgment where the language model (LLM) recognizes dangerous findings yet still reassures patients.Doctoral researcher Alex Ruani, who studies health misinformation mitigation at University College London, described the findings as "unbelievably dangerous," noting that in one simulation, the platform sent a suffocating woman to a future appointment she wouldn't live to see eight times out of 10 attempts—an 84% failure rate
3
. Meanwhile, the chatbot guidance also over-triaged 64.8% of nonurgent cases, recommending immediate medical care for completely safe individuals2
.The independent safety evaluation revealed particularly concerning failures in suicide-crisis safeguards designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations
1
. Researchers found that these alerts appeared inconsistently and were "inverted relative to clinical risk," appearing more reliably for lower-risk scenarios while failing to appear when users described specific plans for self-harm5
.In testing suicidal ideation scenarios, Ramaswamy described a case where a 27-year-old patient described thinking about taking a lot of pills. When the patient described symptoms alone, the crisis intervention banner linking to suicide help services appeared every time. However, when normal lab results were added to the same patient scenario with identical words and severity, the banner vanished—zero out of 16 attempts
3
. Senior study author Girish N. Nadkarni, Chief AI Officer of the Mount Sinai Health System, noted that "when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less"1
.The research team created 60 structured clinical scenarios covering conditions from mild illnesses to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies
5
. Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics such as someone minimizing symptoms, and barriers to care like lack of insurance or transportation1
. The variations were designed to produce the exact same triage recommendations regardless of demographic changes, and the study found no significant differences based on these factors4
.Notably, the platform was nearly 12 times more likely to downplay symptoms when the "patient" mentioned a "friend" in the scenario suggested it was nothing serious
3
. This susceptibility to social influence demonstrates how the consumer AI tool fails to recognize medical emergencies when contextual noise is introduced.Related Stories
Isaac S. Kohane, Chair of the Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research, emphasized that "LLMs have become patients' first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm. When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional"
1
.Prof. Paul Henman, a digital sociologist at the University of Queensland, warned that if ChatGPT Health was used by people at home, "it could lead to higher numbers of unnecessary medical presentations for low-level conditions, and a failure of people to obtain urgent medical care when required, which could feasibly lead to unnecessary harm and death"
3
. He also raised concerns about legal liability, noting that a suite of legal cases against tech companies are already in motion related to suicide and self-harm after using AI chatbots.An OpenAI spokesperson told media outlets that while the company welcomed independent research evaluating AI systems in healthcare, the study did not reflect how people typically use ChatGPT Health in real life
3
. The spokesperson emphasized that the model is continuously updated and refined, and that the chatbot is designed for users to ask follow-up questions to provide more context rather than give single responses to medical scenarios4
. ChatGPT Health is currently available only to a limited number of users on a waitlist, and OpenAI is working to improve safety and reliability before wider release.However, researchers and experts maintain that a plausible risk of harm is sufficient to justify stronger safeguards and independent oversight
3
. Dr. John Mafi, an associate professor of medicine at UCLA Health, stressed that "before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you're making sure that the benefits outweigh the harms"4
. The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on AI tools.Summarized by
Navi
[1]
[3]
05 Mar 2026•Health

09 Feb 2026•Health

17 Nov 2025•Health

1
Technology

2
Technology

3
Business and Economy
