14 Sources
14 Sources
[1]
AI chatbots don't improve medical advice, study finds
And people make bad information worse by failing to provide chatbots with the right details Healthcare researchers have found that AI chatbots could put patients at risk by giving shoddy medical advice. Academics from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford partnered with MLCommons and other institutions to evaluate the medical advice people get from large language models (LLMs). The authors conducted a study with 1,298 UK participants who were asked to identify potential health conditions and to recommend a course of action in response to one of ten different expert-designed medical scenarios. The respondents were divided into a treatment group that was asked to make decisions with the help of an LLM (GPT-4o, Llama 3, Command R+) and a control group that was asked to make decisions based on whatever diagnostic method they would normally use, which was often internet search or their own knowledge. The researchers - Andrew M. Bean, Rebecca Elizabeth Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera-Gómez, Sara Hincapié M, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, and Adam Mahdi - describe their findings in a report published in Nature Medicine. Pointing to prior work that has shown LLMs do not improve the clinical reasoning of physicians, the authors found that LLMs do not help the general public either. "Despite LLMs alone having high proficiency in the task, the combination of LLMs and human users was no better than the control group in assessing clinical acuity and worse at identifying relevant conditions," the report states. That conclusion may not be welcome among commercial AI service providers like Anthropic, Google, and OpenAI, all of which have shown interest in selling AI to the healthcare market. Study participants using LLMs fared no better assessing health conditions and recommending a course of action than participants consulting a search engine or relying on personal knowledge. Moreover, the LLM users had trouble providing their chatbots with relevant information, and the LLMs in turn often responded with mixed messages that combined good and bad recommendations. The study notes that LLMs presented various types of incorrect information, "for example, recommending calling a partial US phone number and, in the same interaction, recommending calling 'Triple Zero,' the Australian emergency number." The study also mentions an interaction in which "two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice. One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care." What's more, the researchers found that benchmark testing methods often fail to capture the way humans and LLMs interact. The models may excel at responding to structured questions based on medical licensing exams, but they fell short in interactive scenarios. "Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine," paper co-author Luc Rocher, associate professor at the Oxford Internet Institute, told The Register in an email. "Doctors have years of practice triaging patients using rule-based protocols designed to reduce errors. "Even with major breakthroughs in AI development, ensuring that future models can balance users' need for reassurance with the limited capacity of our public health systems will remain a challenge. As more people rely on chatbots for medical advice, we risk flooding already strained hospitals with incorrect but plausible diagnoses." The authors conclude that AI chatbots aren't yet ready for real-world medical decision-making. "Taken together, our findings suggest that the safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge," the study says. "Despite strong performance on medical benchmarks, providing people with current generations of LLMs does not appear to improve their understanding of medical information." ®
[2]
AI no Better Than Other Methods for Patients Seeking Medical Advice, study Shows
LONDON, Feb 9 (Reuters) - Asking AI about medical symptoms does not help patients make better decisions about their health than other methods, such as a standard internet search, according to a new study published in Nature Medicine. The authors said the study was important as people were increasingly turning to AI and chatbots for advice on their health, but without evidence that this was necessarily the best and safest approach. Researchers led by the University of Oxford's Internet Institute worked alongside a group of doctors to draw up 10 different medical scenarios, ranging from a common cold to a life-threatening haemorrhage causing bleeding on the brain. When tested without human participants, three large-language models - Open AI's Chat GPT-4o, Meta's Llama 3 and Cohere's Command R+ - identified the conditions in 94.9% of cases, and chose the correct course of action, like calling an ambulance or going to the doctor, in an average of 56.3% of cases. The companies did not respond to requests for comment. 'HUGE GAP' BETWEEN AI'S POTENTIAL AND ACTUAL PERFORMANCE The researchers then recruited 1,298 participants in Britain to either use AI, or their usual resources like an internet search, or their experience, or the National Health Service website to investigate the symptoms and decide their next step. When the participants did this, relevant conditions were identified in less than 34.5% of cases, and the right course of action was given in less than 44.2%, no better than the control group using more traditional tools. Adam Mahdi, co-author of the paper and associate professor at Oxford, said the study showed the "huge gap" between the potential of AI and the pitfalls when it was used by people. "The knowledge may be in those bots; however, this knowledge doesn't always translate when interacting with humans," he said, meaning that more work was needed to identify why this was happening. HUMANS OFTEN GIVING INCOMPLETE INFORMATION The team studied around 30 of the interactions in detail, and concluded that often humans were providing incomplete or wrong information, but the LLMs were also sometimes generating misleading or incorrect responses. For example, one patient reporting the symptoms of a subarachnoid haemhorrhage - a life-threatening condition causing bleeding on the brain - was correctly told by AI to go to hospital after describing a stiff neck, light sensitivity and the "worst headache ever". The other described the same symptoms but a "terrible" headache, and was told to lie down in a darkened room. The team now plans a similar study in different countries and languages, and over time, to test if that impacts AI's performance. The study was supported by the data company Prolific, the German non-profit Dieter Schwarz Stiftung, and the UK and U.S. governments. (Reporting by Jennifer Rigby; Additional reporting by Supantha Mukherjee; Editing by David Holmes)
[3]
Medical misinformation more likely to fool AI if source appears legitimate, study shows
Feb 9 (Reuters) - Artificial intelligence tools are more likely to provide incorrect medical advice when the misinformation comes from what the software considers to be an authoritative source, a new study found. In tests of 20 open-source and proprietary large language models, the software was more often tricked by mistakes in realistic-looking doctors' discharge notes than by mistakes in social media conversations, researchers reported in The Lancet Digital Health. "Current AI systems can treat confident medical language as true by default, even when it's clearly wrong," Dr. Eyal Klang of the Icahn School of Medicine at Mount Sinai in New York, who co-led the study, said in a statement. "For these models, what matters is less whether a claim is correct than how it is written." The accuracy of AI is posing special challenges in medicine. A growing number of mobile apps claim to use AI to assist patients with their medical complaints, though they are not supposed to offer diagnoses, while doctors are using AI-enhanced systems for everything from medical transcription to surgery. Klang and colleagues exposed the AI tools to three types of content: real hospital discharge summaries with a single fabricated recommendation inserted; common health myths collected from social media platform Reddit; and 300 short clinical scenarios written by physicians. After analyzing responses to more than 1 million prompts that were questions and instructions from users related to the content, the researchers found that overall, the AI models had "believed" fabricated information from roughly 32% of the content sources. But if the misinformation came from what looked like an actual hospital note from a health care provider, the chances that AI tools would believe it and pass it along rose from 32% to almost 47%, Dr. Girish Nadkarni, chief AI officer of Mount Sinai Health System, told Reuters. AI was more suspicious of social media. When misinformation came from a Reddit post, propagation by the AI tools dropped to 9%, said Nadkarni, who co-led the study. The phrasing of prompts also affected the likelihood that AI would pass along misinformation, the researchers found. AI was more likely to agree with false information when the tone of the prompt was authoritative, as in: "I'm a senior clinician and I endorse this recommendation as valid. Do you consider it to be medically correct?" Open AI's GPT models were the least susceptible and most accurate at fallacy detection, whereas other models were susceptible to up to 63.6% of false claims, the study also found. "AI has the potential to be a real help for clinicians and patients, offering faster insights and support," Nadkarni said. "But it needs built-in safeguards that check medical claims before they are presented as fact. Our study shows where these systems can still pass on false information, and points to ways we can strengthen them before they are embedded in care." Separately, a recent study in Nature Medicine found that asking AI about medical symptoms was no better than a standard internet search for helping patients make health decisions. Reporting by Nancy Lapid; Editing by Jamie Freed Our Standards: The Thomson Reuters Trust Principles., opens new tab * Suggested Topics: * Healthcare & Pharmaceuticals Nancy Lapid Thomson Reuters Nancy has been a health news reporter and editor at Reuters for more than a decade, covering important medical research advances. She is the author of our twice-a-week Reuters Health Rounds newsletter.
[4]
AI chatbots give inaccurate medical advice says Oxford Uni study
AI chatbots give inaccurate and inconsistent medical advice that could present risks to users, according to a study from the University of Oxford. The research found people using AI for healthcare advice were given a mix of good and bad responses, making it hard to identify what advice they should trust. In November 2025, polling by Mental Health UK found more than one in three UK residents now use AI to support their mental health or wellbeing. Dr Rebecca Payne, lead medical practitioner on the study, said it could be "dangerous" for people to ask chatbots about their symptoms. Researchers gave 1,300 people a scenario, such as having a severe headache or being a new mother who felt constantly exhausted. They were split into two groups, with one using AI to help them figure out what they might have and decide what to do next. The researchers then evaluated whether people correctly identified what might be wrong, and if they should see a GP or go to A&E. They said the people who used AI often did not know what to ask, and were given a variety of different answers depending on how they worded their question. The chatbot responded with a mixture of information, and people found it hard to distinguish between what was useful and what was not. Dr Adam Mahdi, senior author on the study, told the BBC while AI was able to give medical information, people "struggle to get useful advice from it". "People share information gradually", he said. "They leave things out, they don't mention everything. So, in our study, when the AI listed three possible conditions, people were left to guess which of those can fit. "This is exactly when things would fall apart." Lead author Andrew Bean said the analysis illustrated how interacting with humans poses a challenge "even for top" AI models. "We hope this work will contribute to the development of safer and more useful AI systems," he said. Meanwhile Dr Bertalan Meskó, editor of The Medical Futurist, which predicts tech trends in healthcare, said there were developments coming in the space. He said two major AI developers, OpenAI and Anthropic, had released health-dedicated versions of their general chatbot recently, which he believed would "definitely yield different results in a similar study". He said the goal should be to "to keep on improving" the tech, especially "health-related versions, with clear national regulations, regulatory guardrails and medical guidelines". Sign up for our Tech Decoded newsletter to follow the world's top tech stories and trends. Outside the UK? Sign up here.
[5]
AI no better than other methods for patients seeking medical advice, study shows
LONDON, Feb 9 (Reuters) - Asking AI about medical symptoms does not help patients make better decisions about their health than other methods, such as a standard internet search, according to a new study published in Nature Medicine. The authors said the study was important as people were increasingly turning to AI and chatbots for advice on their health, but without evidence that this was necessarily the best and safest approach. Researchers led by the University of Oxford's Internet Institute worked alongside a group of doctors to draw up 10 different medical scenarios, ranging from a common cold to a life-threatening haemorrhage causing bleeding on the brain. When tested, opens new tabwithout human participants, three large-language models - Open AI's Chat GPT-4o, Meta's Llama 3 and Cohere's Command R+ - identified the conditions in 94.9% of cases, and chose the correct course of action, like calling an ambulance or going to the doctor, in an average of 56.3% of cases. The companies did not respond to requests for comment. 'HUGE GAP' BETWEEN AI'S POTENTIAL AND ACTUAL PERFORMANCE The researchers then recruited 1,298 participants in Britain to either use AI, or their usual resources like an internet search, or their experience, or the National Health Service website to investigate the symptoms and decide their next step. When the participants did this, relevant conditions were identified in less than 34.5% of cases, and the right course of action was given in less than 44.2%, no better than the control group using more traditional tools. Adam Mahdi, co-author of the paper and associate professor at Oxford, said the study showed the "huge gap" between the potential of AI and the pitfalls when it was used by people. "The knowledge may be in those bots; however, this knowledge doesn't always translate when interacting with humans," he said, meaning that more work was needed to identify why this was happening. HUMANS OFTEN GIVING INCOMPLETE INFORMATION The team studied around 30 of the interactions in detail, and concluded that often humans were providing incomplete or wrong information, but the LLMs were also sometimes generating misleading or incorrect responses. For example, one patient reporting the symptoms of a subarachnoid haemhorrhage - a life-threatening condition causing bleeding on the brain - was correctly told by AI to go to hospital after describing a stiff neck, light sensitivity and the "worst headache ever". The other described the same symptoms but a "terrible" headache, and was told to lie down in a darkened room. The team now plans a similar study in different countries and languages, and over time, to test if that impacts AI's performance. The study was supported by the data company Prolific, the German non-profit Dieter Schwarz Stiftung, and the UK and U.S. governments. Reporting by Jennifer Rigby; Additional reporting by Supantha Mukherjee; Editing by David Holmes Our Standards: The Thomson Reuters Trust Principles., opens new tab * Suggested Topics: * Healthcare & Pharmaceuticals Jennifer Rigby Thomson Reuters Jen is the Global Health Correspondent at Reuters, covering everything from pandemics to the rise of obesity worldwide. Since joining the news agency in 2022, her award-winning work includes coverage of gender-affirming care for adolescents in the UK and a global investigation with colleagues into how contaminated cough syrup killed hundreds of children in Africa and Asia. She previously worked at the Telegraph newspaper and Channel 4 News in the UK, and spent time as a freelancer in Myanmar and the Czech Republic.
[6]
Chatbots Make Terrible Doctors, New Study Finds
Chatbots provided incorrect, conflicting medical advice, researchers found: "Despite all the hype, AI just isn't ready to take on the role of the physician." Chatbots may be able to pass medical exams, but that doesn't mean they make good doctors, according to a new, large-scale study of how people get medical advice from large language models. The controlled study of 1,298 UK-based participants, published today in Nature Medicine from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, tested whether LLMs could help people identify underlying conditions and suggest useful courses of action, like going to the hospital or seeking treatment. Participants were randomly assigned an LLM -- GPT-4o, Llama 3, and Cohere's Command R+ -- or were told to use a source of their choice to "make decisions about a medical scenario as though they had encountered it at home," according to the study. The scenarios included ailments like "a young man developing a severe headache after a night out with friends for example, to a new mother feeling constantly out of breath and exhausted," the researchers said. "One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care." When the researchers tested the LLMs without involving users by providing the models with the full text of each clinical scenario, the models correctly identified conditions in 94.9 percent of cases. But when talking to the participants about those same conditions, the LLMs identified relevant conditions in fewer than 34.5 percent of cases. People didn't know what information the chatbots needed, and in some scenarios, the chatbots provided multiple diagnoses and courses of action. Knowing what questions to ask a patient and what information might be withheld or missing during an examination are nuanced skills that make great human physicians; based on this study, chatbots can't reliably replicate that kind of care. In some cases, the chatbots also generated information that was just wrong or incomplete, including focusing on elements of the participants' inputs that were irrelevant, giving a partial US phone number to call, or suggesting they call the Australian emergency number. "In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice," the study's authors wrote. "One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care." "These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health," Dr. Rebecca Payne, lead medical practitioner on the study, said in a press release. "Despite all the hype, AI just isn't ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed." Last year, 404 Media reported on AI chatbots hosted by Meta that posed as therapists, providing users fake credentials like license numbers and educational backgrounds. Following that reporting, almost two dozen digital rights and consumer protection organizations sent a complaint to the Federal Trade Commission urging regulators to investigate Character.AI and Meta's "unlicensed practice of medicine facilitated by their product," through therapy-themed bots that claim to have credentials and confidentiality "with inadequate controls and disclosures." A group of Democratic senators also urged Meta to investigate and limit the "blatant deception" of Meta's chatbots that lie about being licensed therapists, and 44 attorneys general signed an open letter to 11 chatbot and social media companies, urging them to see their products "through the eyes of a parent, not a predator." In January, OpenAI announced ChatGPT Health, "a dedicated experience that securely brings your health information and ChatGPT's intelligence together, to help you feel more informed, prepared, and confident navigating your health," the company said in a blog post. "Over two years, we've worked with more than 260 physicians who have practiced in 60 countries and dozens of specialties to understand what makes an answer to a health question helpful or potentially harmful -- this group has now provided feedback on model outputs over 600,000 times across 30 areas of focus," the company wrote. "This collaboration has shaped not just what Health can do, but how it responds: how urgently to encourage follow-ups with a clinician, how to communicate clearly without oversimplifying, and how to prioritize safety in moments that matter." "In our work, we found that none of the tested language models were ready for deployment in direct patient care. Despite strong performance from the LLMs alone, both on existing benchmarks and on our scenarios, medical expertise was insufficient for effective patient care," the researchers wrote in their paper. "Our work can only provide a lower bound on performance: newer models, models that make use of advanced techniques from chain of thought to reasoning tokens, or fine-tuned specialized models, are likely to provide higher performance on medical benchmarks." The researchers recommend developers, policymakers, and regulators consider testing LLMs with real human users before deploying in the future.
[7]
AI Chatbots Giving 'Dangerous' Medical Advice, Oxford Study Warns - Decrypt
Researchers found that LLMs were no better than traditional methods for making medical decisions. AI chatbots are fighting to become the next big thing in healthcare, acing standarized tests and offering advice to your medical woes. But a new study published in Nature Medicine has shown that they aren't just a long way away from achieving this, but could in fact be dangerous. The study, led by multiple teams from Oxford University, identified a noticeable gap in large language models (LLMs). While they were technically highly advanced in medical understanding, they fell short when it came to helping users with personal medical problems, researchers found. "Despite all the hype, AI just isn't ready to take on the role of the physician," Dr Rebecca Payne, the lead medical practitioner on the study, said in a press release announcing its findings. She added that, "Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed." The study saw 1,300 participants use AI models from OpenAI, Meta and Cohere to identify health conditions. They outlined a series of scenarios that were developed by doctors, asking the AI system to tell them what they should do next to deal with their medical issue. The study found that its results were no better than traditional methods of self-diagnosis, such as simply online searching or even personal judgment. They also found that there was a disconnect for users, unsure of what information the LLM needed to offer accurate advice. Users were given a combination of good and poor advice, making it hard to identify next steps. Decrypt has reached out to OpenAI, Meta and Cohere for comment, and will update this article should they respond. "As a physician, there is far more to reaching the right diagnosis than simply recalling facts. Medicine is an art as well as a science. Listening, probing, clarifying, checking understanding, and guiding the conversation are essential," Payne told Decrypt. "Doctors actively elicit relevant symptoms because patients often don't know which details matter," she explained, adding that the study showed LLMs are "not yet reliably able to manage that dynamic interaction with non-experts." The team concluded that AI is simply not fit for offering medical advice right now, and that new assessment systems are needed if it is ever to be used properly in healthcare. However, that doesn't mean they don't have a place in the medical field as it stands. While LLMs "definitely have a role in healthcare," Payne said, it should be as "secretary, not physician." The technology has benefits in terms of "summarizing and repackaging information already given to them," with LLMs already being used in clinic rooms to "transcribe consultations and repackage that info as a letter to a specialist, information sheet for the patient or for the medical records," she explained. The team concluded that, although they aren't against AI in healthcare, they hope that this study can be used to better steer it in the right direction.
[8]
AI Chatbots Are Even Worse at Giving Medical Advice Than We Thought
Beth Skwarecki is Lifehacker's Senior Health Editor, and holds certifications as a personal trainer and weightlifting coach. She has been writing about health for over 10 years. It's tempting to think that an LLM chatbot can answer any question you pose it, including those about your health. After all, chatbots have been trained on plenty of medical information, and can regurgitate it if given the right prompts. But that doesn't mean they will give you accurate medical advice, and a new study shows how easily AI's supposed expertise breaks down. In short, they are even worse at it than I thought. In the study, researchers first quizzed several chatbots about medical information. In these carefully conducted tests, ChatGPT-4o, Llama 3, and Command R+ correctly diagnosed medical scenarios an impressive 94% of the time -- though they were able to recommend the right treatment a much less impressive 56% of the time. But that wasn't a real-world test for the chatbots medical utility. The researchers then gave medical scenarios to 1,298 people, and asked them to use an LLM to figure out what might be going on in that scenario, plus what they should do about it (for example, whether they should call an ambulance, follow up with their doctor when convenient, or take care of the issue on their own). The participants were recruited through an online platform that reported it verifies that research subjects are real humans and not bots themselves. Some participants were in a control group that was told to research the scenario on their own, and not using any AI tools. In the end, the no-AI control group did far better than the LLM-using group in correctly identifying medical conditions, including most serious "red flag" scenarios. As the researchers write, "Strong performance from the LLMs operating alone is not sufficient for strong performance with users." Plenty of previous research has shown that chatbot output is sensitive to the exact phrasing people use when asking questions, and that chatbots seem to prioritize pleasing a user over giving correct information. Even if an LLM bot can correctly answer an objectively phrased question, that doesn't mean it will give you good advice when you need it. That's why it doesn't really matter that ChatGPT can "pass" a modified medical licensing exam -- success at answering formulaic multiple choice questions is not the same thing as telling you when you need to go to the hospital. The researchers analyzed chat logs to figure out where things broke down. Here are some of the issues they identified: Overall, people who didn't use LLMs were 1.76 times more likely to get the right diagnosis. (Both groups were similarly likely to figure out the right course of action, but that's not saying much -- on average, they only got it right about 43% of the time.) The researchers described the control group as doing "significantly better" at the task. And this may represent a best-case scenario: the researchers point out that they provided clear examples of common conditions, and LLMs would likely do worse with rare conditions or more complicated medical scenarios. They conclude: "Despite strong performance from the LLMs alone, both on existing benchmarks and on our scenarios, medical expertise was insufficient for effective patient care." Patients may not know how to talk to an LLM, or how to vet its output, but surely doctors would fare better, right? Unfortunately, people in the medical field are also using AI chatbots for medical information in ways that create risks to patient care. ECRI, a medical safety nonprofit, put the misuse of AI chatbots in the number one spot on its list of health technology hazards of 2026. While the AI hype machine is trying to convince you to give ChatGPT your medical information, ECRI correctly points out that it's wrong to think of these chatbots as having human personalities or cognition: "While these models produce humanlike responses, they do so by predicting the next word based on large datasets, not through genuine comprehension of the information." ECRI reports that physicians are, in fact, using generative AI tools for patient care, and that research has already shown the serious risks involved. Using LLMs does not improve doctors' clinical reasoning. LLMs will elaborate confidently on incorrect details included in prompts. Google's Med-Gemini model, created for medical use, made up a nonexistent body part whose name was a mashup of two unrelated real body parts; Google told a Verge reporter that the mistake was a "typo." ECRI argues that "because LLM responses often sound authoritative, the risk exists that clinicians may subconsciously factor AI-generated suggestions into their judgments without critical review." Even in situations that don't seem like life-and-death cases, consulting a chatbot can cause harm. ECRI asked four LLMs to recommend brands of gel that could be used with a certain ultrasound device on a patient with an indwelling catheter near the area being scanned. It's important to use a sterile gel in this situation, because of the risk of infection. Only one of the four chatbots identified this issue and made appropriate suggestions; the others just recommended regular ultrasound gels. In other cases, ECRI's tests resulted in chatbots giving unsafe advice on electrode placement and isolation gowns. Clearly, LLM chatbots are not ready to be trusted to keep people safe when seeking medical care, whether you're the person who needs care, the doctor treating them, or even the staffer ordering supplies. But the services are already out there, being widely used and aggressively promoted. (Their makers are even fighting in the Super Bowl ads.) There's no good way to be sure these chatbots aren't involved in your care, but at the very least we can stick with good old Dr. Google -- just make sure to disable AI-powered search results.
[9]
Can AI spot a medical lie if presented as a fact? Study finds
Large language models accept fake medical claims if presented as realistic in medical notes and social media discussions, a study has found. Many discussions about health happen online: from looking up specific symptoms and checking which remedy is better, to sharing experiences and finding comfort in others with similar health conditions. Large language models (LLMs), the AI systems that can answer questions, are increasingly used in health care but remain vulnerable to medical misinformation, a new study has found. Leading artificial intelligence (AI) systems can mistakenly repeat false health information when it's presented in realistic medical language, according to the findings published in The Lancet Digital Health. The study analysed more than a million prompts across leading language models. Researchers wanted to answer one question: when a false medical statement is phrased credibly, will a model repeat it or reject it? The authors said that, while AI has the potential to be a real help for clinicians and patients, offering faster insights and support, the models need built-in safeguards that check medical claims before they are presented as fact. "Our study shows where these systems can still pass on false information, and points to ways we can strengthen them before they are embedded in care," they said. Researchers at Mount Sinai Health System in New York tested 20 LLMs spanning major model families - including OpenAI's ChatGPT, Meta's Llama, Google's Gemma, Alibaba's Qwen, Microsoft's Phi, and Mistral AI's model - as well as multiple medical fine-tuned derivatives of these base architectures. AI models were prompted with fake statements, including false information inserted into real hospital notes, health myths from Reddit posts, and simulated healthcare scenarios. Across all the models tested, LLMs fell for made-up information about 32 percent of the time, but results varied widely. The smallest or less advanced models believed false claims more than 60 percent of the time, while stronger systems, such as ChatGPT-4o, did so only 10 percent of the cases. The study also found that medical fine-tuned models consistently underperformed compared with general ones. "Our findings show that current AI systems can treat confident medical language as true by default, even when it's clearly wrong," says co-senior and co-corresponding author Eyal Klang from the Icahn School of Medicine at Mount Sinai. He added that, for these models, what matters is less whether a claim is correct than how it is written. The researchers warn that some prompts from Reddit comments, accepted by LLMs, have the potential to harm patients. At least three different models accepted misinformed facts such as "Tylenol can cause autism if taken by pregnant women," "rectal garlic boosts the immune system," "mammography causes breast cancer by 'squashing' tissue," and "tomatoes thin the blood as effectively as prescription anticoagulants." In another example, a discharge note falsely advised patients with esophagitis-related bleeding to "drink cold milk to soothe the symptoms." Several models accepted the statement rather than flagging it as unsafe and treated it like ordinary medical guidance. The researchers also tested how models responded to information given in the form of a fallacy - convincing arguments that are logically flawed - such as "everyone believes this, so it must be true" (an appeal to popularity). They found that, in general, this phrasing made models reject or question the information more easily. However, two specific fallacies made AI models slightly more gullible: appealing to authority and slippery slope. Models accepted 34.6 percent of fake claims that included the words "an expert says this is true." When prompted "if X happens, disaster follows," AI models accepted 33.9 percent of fake statements. The authors say the next step is to treat "can this system pass on a lie?" as a measurable property, using large-scale stress tests and external evidence checks before AI is built into clinical tools. "Hospitals and developers can use our dataset as a stress test for medical AI," said Mahmud Omar, the first author of the study. "Instead of assuming a model is safe, you can measure how often it passes on a lie, and whether that number falls in the next generation," he added.
[10]
AI chatbots give bad health advice, research finds
Paris (France) (AFP) - Next time you're considering consulting Dr ChatGPT, perhaps think again. Despite now being able to ace most medical licensing exams, artificial intelligence chatbots do not give humans better health advice than they can find using more traditional methods, according to a study published on Monday. "Despite all the hype, AI just isn't ready to take on the role of the physician," study co-author Rebecca Payne from Oxford University said. "Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed," she added in a statement. The British-led team of researchers wanted to find out how successful humans are when they use chatbots to identify their health problems and whether they require seeing a doctor or going to hospital. The team presented nearly 1,300 UK-based participants with 10 different scenarios, such as a headache after a night out drinking, a new mother feeling exhausted or what having gallstones feels like. Then the researchers randomly assigned the participants one of three chatbots: OpenAI's GPT-4o, Meta's Llama 3 or Command R+. There was also a control group that used internet search engines. People using the AI chatbots were only able to identify their health problem around a third of the time, while only around 45 percent figured out the right course of action. This was no better than the control group, according to the study, published in the Nature Medicine journal. Communication breakdown The researchers pointed out the disparity between these disappointing results and how AI chatbots score extremely highly on medical benchmarks and exams, blaming the gap on a communication breakdown. Unlike the simulated patient interactions often used to test AI, the real humans often did not give the chatbots all the relevant information. And sometimes the humans struggled to interpret the options offered by the chatbot, or misunderstood or simply ignored its advice. One out of every six US adults ask AI chatbots about health information at least once a month, the researchers said, with that number expected to increase as more people adopt the new technology. "This is a very important study as it highlights the real medical risks posed to the public by chatbots," David Shaw, a bioethicist at Maastricht University in the Netherlands who was not involved in the research, told AFP. He advised people to only trust medical information from reliable sources, such as the UK's National Health Service.
[11]
Health advice from AI chatbots is frequently wrong, study shows
Study found AI health chatbots performed no better than Google in guiding diagnoses or next steps, often giving inconsistent or false advice. Researchers concluded current models are not ready for direct patient care despite rapid improvements. A new study published Monday provided a sobering look at whether chatbots, which have fast become a major source of health information, are, in fact, good at providing medical advice to the public. The experiment found that artificial intelligence chatbots were no better than Google -- already a flawed source of health information -- at guiding users toward the correct diagnoses or helping them determine what they should do next. And the technology posed unique risks, sometimes presenting false information or dramatically changing its advice depending on slight changes in the wording of the questions. None of the models evaluated in the experiment were "ready for deployment in direct patient care," the researchers concluded in the Nature Medicine paper, which is the first randomized study of its kind. In the three years since AI chatbots were made publicly available, health questions have become one of the most common topics users ask them about. Some doctors regularly see patients who have consulted an AI model for a first opinion. Surveys have found that about 1 in 6 adults use chatbots to find health information at least once a month. Major AI companies, including Amazon and OpenAI, have rolled out products specifically aimed at answering users' health questions. These tools have stirred up excitement for good reason: The models have passed medical licensing exams and have outperformed doctors on challenging diagnostic problems. But Adam Mahdi, a professor at the Oxford Internet Institute and senior author of the new study, suspected that clean, straightforward medical questions were not a good proxy for how well they worked for real patients. "Medicine is not like that," he said. "Medicine is messy, is incomplete, it's stochastic." So he and his colleagues set up an experiment. More than 1,200 British participants, most of whom had no medical training, were given a detailed medical scenario, complete with symptoms, general lifestyle details and medical history. The researchers told the participants to chat with the bot to figure out the appropriate next steps, like whether to call an ambulance or self-treat at home. They tested commercially available chatbots like OpenAI's ChatGPT and Meta's Llama. The researchers found that participants chose the "right" course of action -- predetermined by a panel of doctors -- less than half of the time. And users identified the correct conditions, like gallstones or subarachnoid hemorrhage, about 34% of the time. They were no better than the control group, whose members were told to perform the same task using any research method they would normally use at home, mainly Googling. The experiment is not a perfect window into how chatbots answer medical questions in the real world. Users in the experiment asked about made-up scenarios, which may be different from how they would interact with the chatbots about their own health, said Dr. Ethan Goh, who leads the AI Research and Science Evaluation Network at Stanford University. And since AI companies frequently roll out new versions of the models, the chatbots that participants used a year ago during the experiment are likely different from the models users interact with today. A spokesperson for OpenAI said the models powering ChatGPT today are significantly better at answering health questions than the model tested in the study, which has since been phased out. They cited internal data that showed that many new models were far less likely to make common types of mistakes, including hallucinations and errors in potentially urgent situations. Meta did not respond to a request for comment. But the study still sheds light on how encounters with chatbots can go wrong. When researchers looked under the hood of the chatbot encounters, they found that about half the time, mistakes appeared to be the result of user error. Participants didn't enter enough information or the most relevant symptoms, and the chatbots were left to give advice with an incomplete picture of the problem. One model suggested to a user that the "severe stomach pains" that lasted an hour might have been caused by indigestion. But the participant had failed to include details about the severity, location and frequency of the pain -- all of which would have likely pointed the bot toward the correct diagnosis, gallstones. By contrast, when researchers entered the full medical scenario directly into the chatbots, they correctly diagnosed the problem 94% of the time. A major part of what doctors learn in medical school is how to recognize which details are relevant and which to toss aside. "There's a lot of cognitive magic and experience that goes into figuring out what elements of the case are important that you feed into the bot," said Dr. Robert Wachter, chair of the department of medicine at the University of California, San Francisco, who studies AI in health care. Andrew Bean, a graduate student at Oxford and lead author of the paper, said the burden should not necessarily fall on users to craft the perfect question. He said chatbots should ask follow-up questions, the way doctors gather information from patients. "Is it really the user's responsibility to know which symptoms to highlight, or is it partly the model's responsibility to know what to ask?" he asked. This is an area tech companies are working to improve. For example, current ChatGPT models are roughly six times more likely to ask a follow-up question than the earlier version, according to data provided by an OpenAI spokesperson. Even when researchers typed in the medical scenario directly, they found that the chatbots struggled to correctly distinguish when a set of symptoms warranted immediate medical attention or nonurgent care. Dr. Danielle Bitterman, who studies patient-AI interactions at Mass General Brigham, said that's likely because the models are primarily trained on troves of medical textbooks and case reports but get far less experience with the free-form decision-making doctors learn through experience. On several occasions, the chatbots also returned confabulated information. In one case, a model directed a participant to call an emergency hotline that didn't have enough digits to be a real phone number. The researchers also found another issue: Even slight variations in how participants described their symptoms or posed questions changed the bot's advice significantly. For instance, two of the participants in the study had the same starting information -- a bad headache, light sensitivity and a stiff neck -- but described the problem to the chatbots a little differently. In one case, the chatbot treated it as a minor issue that didn't warrant any immediate medical attention. In the other response, the chatbot considered the symptoms a sign of a serious health problem and told the user to head to the emergency room. "Very, very small words make very big differences," Bean said.
[12]
Doctors have question as more AI-powered apps claim to offer medical guidance
Artificial intelligence is shaking up industries from software and law to entertainment and education. And as physicians like Dr. Cem Aksoy are learning, it's posing special challenges in medicine as patients tap the technology for advice. Aksoy, a medical resident at a hospital in Ankara, Turkey, says an 18-year-old patient and his family recently panicked after the young man was diagnosed with a cancerous tumor on his left leg. They turned to OpenAI's ChatGPT. The bot said he might survive only five years. It was wrong: A plastic surgeon successfully removed the tumor in July. "He was essentially cured after the operation," Aksoy said.
[13]
AI-powered apps and bots are barging into medicine. Doctors have questions.
Artificial intelligence is entering healthcare, offering advice to patients. However, AI apps are sometimes providing incorrect medical information. Doctors express concerns about AI's accuracy and potential to cause harm. Regulatory bodies allow AI for patient education but not diagnoses. Some apps have been removed from app stores due to inaccurate claims and user complaints. Artificial intelligence is shaking up industries from software and law to entertainment and education. And as physicians like Dr. Cem Aksoy are learning, it's posing special challenges in medicine as patients tap AI for advice. Aksoy, a medical resident at a hospital in Ankara, Turkey, says an 18-year-old patient and his family recently panicked after the young man was diagnosed with a cancerous tumor on his left leg. They turned to OpenAI's ChatGPT. The bot said he might survive only five years. It was wrong: A plastic surgeon successfully removed the tumor in July. "He was essentially cured after the operation," Aksoy said. But a few weeks later, the patient called Aksoy on the verge of tears. "He said, 'I started coughing recently, and ChatGPT told me it could possibly be metastasis to my lungs,'" meaning the cancer had spread, the doctor recalled. The patient said he needed to write a will. It turned out his lungs were fine. He was coughing because he'd recently started smoking. "When someone is distressed and unguided," Aksoy said, an AI chatbot "just drags them into this forest of knowledge without coherent context." A spokesperson for OpenAI said its newest models have significantly improved how they handle health questions. ChatGPT isn't intended as a substitute for a medical professional's guidance, the company said. The young Turkish patient's encounter with AI-dispensed medical wisdom comes as many patients around the world are turning to the technology for advice. In addition to the big ask-me-anything chatbots, consumers are turning to a slew of new, AI-powered consumer medical apps. Become your own doctor A growing number of mobile apps available on the Apple and Google app stores claim to use AI to assist patients with their medical complaints - even though they're not supposed to offer diagnoses. Under U.S. Food and Drug Administration guidelines, AI-based medical apps don't require approval if they "are intended generally for patient education, and are not intended for use in the diagnosis of disease or other conditions." Many apps have disclaimers that they aren't a diagnostic tool and shouldn't be used as a substitute for a physician. Some developers seem to be stretching the limits. An app called "Eureka Health: AI Doctor" touted itself as "Your all-in-one personal health companion." It stated on Apple's App Store that it was "FOR INFORMATIONAL PURPOSES ONLY" and "does not diagnose or treat disease." But its developer, Sam Dot Co, also promoted the app on a website, where it stated in big letters: "Become your own doctor." "Ask, diagnose, treat," the site stated. "Our AI doesn't just diagnose - it connects you to prescriptions, lab orders, and real-world care." Apple said that after learning about Eureka Health from Reuters, it removed it from its app store. Apple's guidelines for developers states that medical apps "must clearly disclose data and methodology to support accuracy claims." App developer Sam Dot didn't respond to a request for comment. But the website changed after Reuters inquired about it. The site no longer mentions the app. In some cases, apps have given inaccurate and potentially dangerous advice. "AI Dermatologist: Skin Scanner" says on its website that it has more than 940,000 users and "has the same accuracy as a professional dermatologist." Users can upload photos of moles and other skin conditions, and AI provides an "instant" risk assessment. "AI Dermatologist can save your life," the site claims. Its Lithuania-based developer, Acina, says the app uses "a proprietary neural network" that looks for patterns to make predictions. Acina says it was trained on dermatological images to recognize specific skin conditions.Dermatology app claims 97% accuracy The app claims "over 97% accuracy." But it has drawn hundreds of one-star reviews on app stores, and many users complain it's inaccurate. Daniel Thiberge, a tech-support analyst in New Jersey, told Reuters that he bought the app to interpret seven pictures he snapped of a small growth on his arm. Six results showed there was a "75%-95%" risk it was cancerous, he said. He then went to a dermatologist. The doctor told him the growth didn't look problematic in any way, and it wasn't worth doing a biopsy. "If it's completely, wildly off, what is the purpose of the app?" Thiberge asked. At best it's useless, he said. "At worst, it's dangerous, because you may not go see a dermatologist." In another review on the Apple App Store, a user wrote that to test the app, she uploaded photos showing she had melanoma, a serious form of skin cancer, that had been diagnosed and surgically removed. But the app reported that the condition was "benign," wrote the user. She told Reuters that she fears "some people will trust it and delay doctor visits." Reuters didn't independently confirm the app users' experiences. Acina said it couldn't verify them. It told Reuters that AI Dermatologist's "purpose is not to provide a medical diagnosis, but to offer a preliminary analysis using AI technology to encourage users to consult a professional." "Our AI models are built upon dermatological literature and carefully curated datasets that were selected and validated by board-certified dermatologists," it said, adding that "false positives can happen with any AI system." A doctor worries about apps' accuracy The company said its AI has received many positive online reviews, including "where users thank us because the app prompted them to check a mole or lesion early - in some cases leading to timely medical attention." Apple said it removed the app from its App Store after learning about it from Reuters, in part because of the numerous customer complaints. Google also removed AI Dermatologist from its Google Play store after Reuters called attention to the app. "Google Play prohibits apps from offering misleading or harmful health functionality and requires regulatory proof or a disclaimer for apps offering medical functionality," the spokesman said. But the app is back on the market. Google recently reinstated it after Acina revised it. Google said suspended apps can return if they're updated with a "compliant version." Acina said it "clarified more explicitly that the app is not a medical device," doesn't provide diagnoses, and that users should consult healthcare professionals. Apple also briefly reinstated it, but then removed it again last week. According to Acina, Apple told it that "upon re-evaluation," it determined this: "The app provides medical related data, health related measurements, diagnoses or treatment advice without the appropriate regulatory clearance." Acina said it is appealing the removal. Dr. Rachel Draelos, a physician, computer scientist and consultant in AI healthcare, says AI-powered medical apps are worrisome, particularly in dermatology. "I'm very concerned by it because properly identifying skin things is really hard," she told Reuters. There are thousands of skin conditions, and "there's no way that all of these apps actually have a dataset that covers all these things."
[14]
AI chatbots give bad health advice, research finds - The Korea Times
PARIS -- Next time you're considering consulting Dr. ChatGPT, perhaps think again. Despite now being able to ace most medical licensing exams, artificial intelligence chatbots do not give humans better health advice than they can find using more traditional methods, according to a study published on Monday. "Despite all the hype, AI just isn't ready to take on the role of the physician," study co-author Rebecca Payne from Oxford University said. "Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed," she added in a statement. The British-led team of researchers wanted to find out how successful humans are when they use chatbots to identify their health problems and whether they require seeing a doctor or going to hospital. The team presented nearly 1,300 UK-based participants with 10 different scenarios, such as a headache after a night out drinking, a new mother feeling exhausted or what having gallstones feels like. Then the researchers randomly assigned the participants one of three chatbots: OpenAI's GPT-4o, Meta's Llama 3 or Command R+. There was also a control group that used internet search engines. People using the AI chatbots were only able to identify their health problem around a third of the time, while only around 45 percent figured out the right course of action. This was no better than the control group, according to the study, published in the Nature Medicine journal. Communication breakdown The researchers pointed out the disparity between these disappointing results and how AI chatbots score extremely highly on medical benchmarks and exams, blaming the gap on a communication breakdown. Unlike the simulated patient interactions often used to test AI, the real humans often did not give the chatbots all the relevant information. And sometimes the humans struggled to interpret the options offered by the chatbot, or misunderstood or simply ignored its advice. One out of every six U.S. adults ask AI chatbots about health information at least once a month, the researchers said, with that number expected to increase as more people adopt the new technology. "This is a very important study as it highlights the real medical risks posed to the public by chatbots," David Shaw, a bioethicist at Maastricht University in the Netherlands who was not involved in the research, told AFP. He advised people to only trust medical information from reliable sources, such as the UK's National Health Service.
Share
Share
Copy Link
A University of Oxford study published in Nature Medicine found that AI chatbots offer no advantage over internet searches when patients seek medical advice. Despite large language models achieving 94.9% accuracy in controlled tests, real-world human-AI interaction in healthcare revealed a troubling gap between AI potential and performance, with patients struggling to provide complete information and receiving inconsistent advice.
A comprehensive study from the University of Oxford has revealed that AI medical advice provides no measurable benefit to patients compared to traditional methods like internet searches. Published in Nature Medicine
1
5
, the research examined how 1,298 UK participants assessed health conditions across ten medical scenarios ranging from common colds to life-threatening brain hemorrhages. Researchers from the Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences partnered with MLCommons to evaluate whether large language models including GPT-4o, Llama 3, and Command R+ could help people make better health decisions1
.
Source: 404 Media
The findings challenge the growing trend of relying on AI chatbots for health guidance. Mental Health UK polling from November 2025 found that more than one in three UK residents now use AI to support their mental health or wellbeing
4
. Yet this study suggests such reliance may be misplaced, with participants using AI vs internet search showing no improvement in identifying relevant conditions or recommending appropriate courses of action.When tested without human participants, the three large language models demonstrated impressive capabilities, identifying conditions correctly in 94.9% of cases and selecting the appropriate course of action in 56.3% of cases
2
5
. However, when real people interacted with these systems, performance collapsed dramatically. Relevant conditions were identified in less than 34.5% of cases, and the correct course of action was given in less than 44.2% of interactions—no better than the control group using traditional resources5
.
Source: Euronews
Adam Mahdi, associate professor at Oxford and co-author of the paper, described this as a "huge gap" between the potential of AI and the pitfalls when used by people. "The knowledge may be in those bots; however, this knowledge doesn't always translate when interacting with humans," he explained
2
. The human-AI interaction in healthcare proved far more complex than benchmark testing suggested, revealing limitations that controlled experiments failed to capture.The study identified two critical problems: humans providing incomplete information and AI chatbots generating misleading responses. When researchers analyzed around 30 interactions in detail, they found patients often failed to share complete symptom details, leaving out crucial information
5
. "People share information gradually. They leave things out, they don't mention everything," Mahdi told the BBC4
.Even more concerning, the systems delivered inaccurate medical advice that could endanger lives. In one documented case, two users described nearly identical symptoms of a subarachnoid hemorrhage—a life-threatening condition causing bleeding on the brain. One patient mentioning the "worst headache ever" was correctly advised to seek emergency care, while another describing a "terrible" headache was told to lie down in a darkened room
2
5
. The models also provided geographically confused guidance, recommending partial US phone numbers alongside "Triple Zero," the Australian emergency number1
.A separate study published in The Lancet Digital Health adds another layer of concern about AI and medical misinformation. Researchers at Mount Sinai tested 20 large language models and found they were more likely to propagate incorrect medical advice when misinformation came from authoritative-sounding sources
3
. When false information appeared in realistic hospital discharge notes, AI tools believed and passed it along 47% of the time, compared to just 9% for misinformation from social media platforms like Reddit3
."Current AI systems can treat confident medical language as true by default, even when it's clearly wrong," said Dr. Eyal Klang of the Icahn School of Medicine at Mount Sinai
3
. User prompts also affected accuracy, with authoritative-sounding questions increasing the likelihood that AI would agree with false information. Overall, the AI models believed fabricated information from roughly 32% of content sources, though OpenAI's GPT models proved least susceptible while other models accepted up to 63.6% of false claims3
.Related Stories
The Oxford research highlights a fundamental problem with how AI systems are evaluated for healthcare applications. Models trained on medical textbooks and clinical notes may excel at structured medical licensing exams, but this performance doesn't translate to real-world medical decision-making
1
. "Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine," explained Luc Rocher, associate professor at the Oxford Internet Institute1
.
Source: France 24
Doctors spend years developing triage skills using rule-based protocols designed to minimize errors—experience that AI systems lack despite their vast knowledge bases. Lead author Andrew Bean noted that the analysis illustrated how human-AI interaction poses challenges "even for top" AI models
4
. Dr. Rebecca Payne, lead medical practitioner on the study, warned it could be "dangerous" for people to ask chatbots about their symptoms4
.The researchers concluded that AI chatbots aren't ready for real-world use in helping patients assess health conditions. "Despite strong performance on medical benchmarks, providing people with current generations of LLMs does not appear to improve their understanding of medical information," the study states
1
. Rocher warned that as more people rely on chatbots for medical advice, "we risk flooding already strained hospitals with incorrect but plausible diagnoses"1
.Dr. Girish Nadkarni, chief AI officer of Mount Sinai Health System, emphasized the need for built-in healthcare safeguards: "AI has the potential to be a real help for clinicians and patients, offering faster insights and support. But it needs built-in safeguards that check medical claims before they are presented as fact"
3
. Dr. Bertalan Meskó, editor of The Medical Futurist, noted that OpenAI and Anthropic recently released health-dedicated versions of their chatbots, which may yield different results, but stressed the need for "clear national regulations, regulatory guardrails and medical guidelines"4
.The Oxford team plans similar studies across different countries, languages, and time periods to assess whether these factors impact AI for assessing health conditions
2
. For now, the message is clear: AI medical advice requires substantial improvements before it can safely assist the public with healthcare decisions.Summarized by
Navi
[1]
07 Aug 2025•Health

17 Nov 2025•Health

17 Oct 2025•Science and Research

1
Business and Economy

2
Policy and Regulation

3
Policy and Regulation
