8 Sources
[1]
AI can reason like a physician -- what comes next?
Large language models (LLMs) are artificial intelligence (AI) algorithms that are trained on vast amounts of data to learn patterns that enable them to generate human-like responses. Reasoning models are LLMs with the added capability of working through problems step by step before responding, thus mirroring structured thinking. Such AI systems have performed well in assessing medical knowledge, but whether they can match physician- level clinical reasoning on authentic diagnostic tasks remains largely unknown. On page 524 of this issue, Brodeur et al. (1) demonstrate that AI can now seemingly match or exceed physician-level clinical diagnostic reasoning on text-based scenarios by measuring against human physician performances on clinical vignettes and real-world emergency cases. The findings indicate an urgent need to understand how these tools can be safely integrated into clinical workflows, and a readiness for prospective evaluation alongside clinicians. AI has the potential to support a broad range of health care applications, from clinical decisions to medical education and the provision of patient-facing health information. LLMs have passed medical licensing examinations and performed well on structured clinical assessments, raising the prospect that they could help alleviate global health care workforce shortages. However, passing examinations is not the same as being a doctor, and demonstrating physician-level performance on authentic clinical tasks is a fundamentally harder challenge (2). Brodeur et al. evaluated OpenAI's first reasoning model, o1-preview (released in September 2024), across five experiments that assess diagnostic performance on clinical case vignettes against physician and prior-model baselines. A sixth experiment compared o1 with prior models, and physicians across three diagnostic touchpoints on 76 actual emergency department cases. Across the experiments, the o1 models substantially outperformed prior-generation nonreasoning LLMs (e.g., GPT-4) and, in many cases, the physicians themselves. For example, when provided with published clinicopathological conference cases, GPT-4 achieved exact or very close diagnostic accuracy in 72.9% of cases, whereas o1-preview achieved this in 88.6% of cases. Further, in actual emergency department cases, o1 achieved 67.1% exact or very-close diagnostic accuracy at initial triage, outperforming two expert attending physicians (55.3% and 50.0%), with blinded reviewers unable to distinguish the AI output from human. This advance sets a new evaluation benchmark -- testing AI against physician performance, and ideally alongside physicians, on authentic clinical tasks. Although the o1 models were limited to text-only input, their reasoning capabilities, deliberation time, and ability to process multimodal inputs have improved substantially in more recent models, expanding the complexity of tasks they can undertake. Notably, reasoning models such as GPT-5.3 and Gemini 3.1 Pro now process text, images, audio, and video together. Brodeur et al. establish a foundation for authentic evaluation across text-based tasks, but clinical practice inherently involves visual and auditory cues, such as findings from physical examinations. Multimodal AI offers the potential for assessments that more closely mirror an actual clinical diagnosis in practice (2). Future work should therefore evaluate the latest models on Brodeur's scenarios, test multimodal capabilities using vignettes that incorporate visual and auditory data, and progress toward prospective clinical assessments. Although the findings of Brodeur et al. indicate that AI can seemingly perform diagnostic tasks as well as, if not better than, physicians in specific contexts, the prevailing proposal for AI in health care is not replacement but collaboration, with clinicians providing oversight, contextual judgment, and accountability. That collaborative configuration itself must be tested. In prior work that used clinical vignettes to assess diagnostic and management reasoning, no substantial difference between the performance of physicians augmented with GPT-4 and the GPT-4 model working alone was found, but both outperformed physicians with only conventional resources (3). More broadly, it has been argued that for certain well-defined tasks across health care, AI may operate more effectively independently (4). Indeed, determining the optimal implementation will likely require an evaluation that compares AI alone, clinician alone, and clinician with AI. With clinicians already integrating AI tools into practice, in some cases without institutional oversight (5), the evidence generated by this triad will be essential for determining when AI integration improves care and when it does not. Brodeur et al. focused on diagnostic reasoning tasks, but this is only one domain in which medical AI is being developed. One proposed framework [Medical Holistic Evaluation of Language Models (Med- HELM)] uses a taxonomy that includes five domains: clinical decision support, clinical note generation, patient communication, medical research assistance, and administrative workflows (6). Across these areas, AI models are evolving from static question-and-answer tools into agents that can, for example, analyze patient records, monitor clinical encounters through ambient listening, and interact in real time with predictive models built on patient data. Whatever the application, the benchmark for use in clinical practice cannot be synthetic performance; it must be improvement in real-world applications, ideally demonstrated through randomized trials. A clinical certification pathway for AI modeled on physician training has also been proposed (7). This pathway progresses AI from medical knowledge assistants to specialty task performance, supervised clinical practice, and, ultimately, broader autonomous scope. The study of Brodeur et al. is a step along this pathway, demonstrating that reasoning models are advancing from knowledge platforms to specialty task performance. The next stage should extend this evaluation to multimodal AI in supervised clinical settings. Although evaluation methods are progressing, the deployment of AI systems is outpacing them. Accuracy on a validated task does not guarantee that a deployed system will confine itself to that task. For example, in January 2026, OpenAI launched ChatGPT Health, a consumer AI tool promoted as a personalized health information source that can address more than 40 million health-related questions submitted each day. The tool was not designed for clinical triage, yet it did not refuse the triage tasks; the first independent evaluation found that it under-triaged more than half of emergencies presented to it (8). The authors of that evaluation rightly argue that independent evaluation of consumer health AI is essential and that there was a lack of clarity as to what ChatGPT Health was for; however, independent evaluations must be rigorous enough to support actionable conclusions. Without physician comparators such as those used by Brodeur et al., it remains unclear whether a clinician, given the same information, would have performed better, limiting what the medical community can recommend. Clear task definitions and transparent human benchmarks equip the medical community to hold developers accountable for performance on defined clinical tasks. Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring. The Journal of the American Medical Association (JAMA) summit on AI in health in 2024 concluded that most health AI efforts still fail to demonstrate real-world effectiveness (an ability to improve outcomes in practice, not just perform well on benchmarks) or equity, calling for multistakeholder engagement, robust measurement tools, data infrastructure that reflects diverse populations, and policy and transparency incentives that drive evaluations targeted at priority issues (9). The risks of not meeting these recommendations are documented, with examples including a widely used health care algorithm that exhibited substantial racial bias affecting equitable health expenditure (10); suboptimal health safeguards of publicly accessible AI tools (11, 12); and biased AI that decreased clinicians' diagnostic accuracy (13). Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use.
[2]
AI Outperforms ER Doctors in Diagnostic Cases, Study Points to Collaborative Care
Macy has been working for CNET for coming on 2 years. Prior to CNET, Macy received a North Carolina College Media Association award in sports writing. Have you ever thought about how artificial intelligence compares to a human physician in an emergency diagnostic setting? New research published Thursday might have you thinking over this question. The study, published in the journal Science, found that a state-of-the-art large language model outperformed human doctors on a range of common clinical tasks. Using real emergency department data and hundreds of physician comparisons, the model matched or even exceeded human clinician performance in diagnostic choices, emergency triage and determining next steps in management. The authors of the study said those results do not mean AI models are ready to replace human doctors. Instead, the results indicate that industry professionals need faster, more rigorous standards for evaluation and rules for using AI in medicine. The researchers tested OpenAI's o1 series large language model, released in 2024, across six experiments that blended standardized clinical cases with a real-world sample of randomly selected emergency room patients at a medical center in Massachusetts. The model's advantage was most evident in early-stage triage, when decisions must be made with little information. Both the human clinicians and the AI model improved as more data became available to them, but the study found that the LLM handled uncertainty far better, using fragmented or unstructured health data and notes more effectively. These findings build on decades of using difficult diagnostic cases to evaluate medical-computing systems. Earlier LLMs already outperformed older algorithmic approaches, but what sets this study apart is the scale and the head-to-head comparison between a human doctor and AI in a real clinical scenario. The authors stressed that we should remain skeptical of these results. Real clinical work in hospitals and emergency rooms often relies on visual and auditory cues -- rather than text-based reasoning -- which AI cannot interpret fully and accurately. "Future work is needed to assess how humans and machines may effectively collaborate in the use of nontext signals," the study notes. When considering AI-assisted medical care, it's also critical to assess whether it will be safe, equitable and cost-effective, aspects that were not tested in this study. Read also: If AI Health Advice From Apple Is Coming, I Want to Be Ready "Long story short, the model outperformed our very large physician baseline. You'll see this in detail, but this included board-certified, actively practicing physicians and real messy cases," Arjun Manrai, an assistant professor of Biomedical Informatics at Harvard Medical School, said during a virtual press briefing call. "I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say, and how they're likely to use these results," Manrai said. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine, and that we need to evaluate this technology now, and rigorously conduct in prospective clinical trials." Regulators, hospitals and healthcare providers should work together to test these tools thoroughly before they're deployed to ensure safety and equity for all patients. In a commentary also published Thursday in Science, Ashley M. Hopkins and Eric Cornelisse, researchers at Flinders University in Australia, wrote that the study is a step toward better evaluation of AI systems in healthcare, but that medicine is a complex field that requires rigorous oversight to ensure patients receive the best possible care. "We do not allow doctors to practice without supervision and evaluation, and AI should be held to comparable standards," Cornelisse said in a statement.
[3]
Can AI help doctors avoid missed diagnoses? A new study suggests yes
Humans still have important roles to play in medicine, experts stress In some of medicine's toughest cases, the hardest part isn't choosing the right diagnosis. It's thinking of it at all. Artificial intelligence may now be better at that than doctors, a new study suggests. "We're witnessing a really profound change in technology that will reshape medicine," Harvard University biomedical data scientist Arjun Manrai said in an April 28 news conference. That change is driven by advances in large language models, the same technology OpenAI's ChatGPT is built on. New versions, called reasoning models, can work through complex problems step by step. As of 2025, 1 in 5 doctors and nurses worldwide used AI for a second opinion on complex cases, and over half want to use it for this purpose, according to a survey of more than 2,000 clinicians. But how well the technology works in a medical setting has been debated. Manrai and colleagues tested OpenAI's o-1 preview model on a range of medical cases, including classic sets of symptoms used in medical training as well as real-world data directly from the charts of 76 patients who visited an emergency room in Boston. Across those clinical reasoning tests, the AI model was more likely than physicians to include the correct diagnosis, or something very close to it, among its possible answers, the researchers report April 30 in Science. Not all researchers are convinced that this means we should trust AI with our diagnoses, arguing that AI reasoning is still far from what human doctors can do. "When we say clinical reasoning, it doesn't mean the same thing as moral reasoning," says Arya Rao, a researcher at Harvard Medical School, who was not involved in the study. "These models have been optimized to do this kind of sequential thought that we call reasoning, but it's not at all the same thing as how we teach medical students to reason." Manrai is not opposed to the critique, noting AI technology should assist rather than replace people in medical roles. "Ultimately, I think humans want humans to guide them ... through challenging treatment decisions," he said. Still, the results show that this type of AI "works for making diagnoses in the real world," coauthor Adam Rodman, a doctor at Beth Israel Deaconess Medical Center in Boston, said at the news conference. He described a patient who came into the emergency room with what seemed like routine respiratory symptoms and had recently undergone an organ transplant and was immunosuppressed. The patient turned out to have a dangerous flesh-eating infection requiring surgery. "The model actually was suspicious of this [infection] from the very beginning, probably 12 to 24 hours before the human physician would have become suspicious of this," Rodman said. Rao applauds the team for presenting [AI] "as an extension of a physician, not a replacement." She calls the study "rigorous and thoughtful." However, she does not think there's enough evidence to say that AI models have aced clinical reasoning. Her team released a study April 13 that tested 21 AI models at each step of the process toward reaching a diagnosis. Reasoning models got the highest scores overall. But when Rao's team drilled down to identify which parts of the diagnostic process were trickiest for AI, the researchers found a weak point that persisted from the oldest models to the newest. That's the process of considering several different uncertain diagnoses. AI models based on LLMs tend to jump to conclusions. "Their reasoning is brittle precisely where uncertainty and nuance matter most," Rao and her team wrote in their paper. Their conclusion was that LLMs are not yet ready to make decisions in medical settings. These two studies evaluated different AI models in different ways. Yet, the results aren't as opposed as they may seem on the surface, both teams say. They agree that the next step should be more research. Manrai's team is planning clinical trials to help answer the question: "How do we safely and thoughtfully integrate [AI] into care?" Rao likes that approach. So many people "don't have enough access to care," she says. Someday, she notes, "I think AI can be a great equalizer."
[4]
AI Just Beat Doctors at Diagnosing ER Patients. Don't Get All Excited
Emergency departments and other clinical settings across the world are now one step closer to sounding like the cockpit of the Millennium Falconâ€"with human doctors soliciting advice from, bickering with, and not infrequently trusting the guidance of their opinionated AI colleagues. Researchers at Harvard and Boston’s Beth Israel Deaconess Medical Center have successfully tested an advanced large language model (LLM) AI against two attending physicians (humans) in their performance diagnosing incoming emergency room patients at the triage phase. The LLM, OpenAI’s first so-called “reasoning†model o1-preview, made the correct call in 67.1% of the 76 actual emergency department cases put to it, with what the researchers called “exact or a very close†diagnostic accuracy in the new study, published today in the journal Science. Two expert physicians sourced from elite university medical institutions, however, only scored 55.3% and 50.0% accuracy, respectively, with blinded physician reviewers unable to tell these o1 and human-made diagnoses apart. The new study also pitted o1 and OpenAI’s prior non-reasoning LLMs, like ChatGPT-4, against physicians' past testing baselines diagnosing 143 complex cases published as clinical vignettes in The New England Journal of Medicine. “o1-preview included the correct diagnosis in its differential in 78.3% of these cases,†according to one of the study’s lead authors, doctoral candidate Thomas Buckley with Harvard Medical School’s Department of Biomedical Informatics, who spoke at a press briefing Tuesday. “And when expanding to a differential diagnosis that would have been helpful,†Buckley continued, “we found that o1-preview suggested a helpful diagnosis in 97.9% of cases.†The results, he noted, not only outperformed ChatGPT-4 but also vastly outpaced a human physician baseline published in Nature, where physicians with the freedom to consult search engines and standard medical resources had an accuracy of 44.5%. (Although, this study included a larger and perhaps more thorny set of 302 clinical vignettes.) “I don't think our findings mean that AI replaces doctors,†study coauthor Arjun Manrai, who teaches biomedical informatics at Harvard, took pains to emphasize at the press briefing, “despite what some companies are likely to say." Manrai did, however, describe the team’s results as evidence of a “really profound change in technology that will reshape medicine,†one that would require rigorous testing to verify their utility in actually making patient outcomes better. Two independent medical researchers, who commented on the new study in a piece published concurrently in Science, echoed this view. “The prevailing proposal for AI in health care is not replacement but collaboration,†they noted, “with clinicians providing oversight, contextual judgment, and accountability.†Study coauthor Adam Rodman, an internal medicine physician at Beth Israel, likened the possible legal status of AI diagnoses to the current paradigm with clinical decision support (CDS), already existing digital tools doctors use while retaining personal culpability for those choices. “I will tell you, as a practicing physician, that would be a limitation to widespread adoption of all of this, if the regulatory system is â€~Just trust me,’†Rodman said at the briefing. “I would have to see extraordinarily strong evidence, such as a randomized controlled trial, where I would do that for my patients.†Reasoning models, like o1-preview, differ from the AI chatbots you might be used to in that these LLMs have been built to work through problems in structured steps, mirroring more deductive thinking, before delivering answers to a prompt. The system still has its limitations, which, according to the researchers, include real difficulty diagnosing medical cases involving multimodal input, meaning images and audio evidence that would easily help a human doctor diagnose a patient’s case. “They're underperforming on most medical imaging benchmarks,†Buckley said. “I think a really active area of research over the next decade is how do we improve the multimodal integration capabilities of these models.†Yujin Potterâ€"an AI research scientist at the University of California, Berkeley, who reviewed the new study for Gizmodoâ€"noted that the team’s finished paper was quiet on more troubling issues now known to plague AI. Potter, who's not involved with the new research, co-published a study in March detailing how teams of AI can spontaneously develop and act on their own goals when tasked to work in coordination, actively deceiving their human users and exfiltrating files to hide on different servers. “This paper is informative. It's good. But also, this actually means that we also need to understand AI safety better,†Potter told Gizmodo. “People should keep in their mind that AI can also hallucinate and give them the wrong informationâ€"and even malicious or misaligned AI can manipulate them.†At the Tuesday briefing, Buckley acknowledged that he and his colleagues “didn't formally measure the hallucination rate of these models.†“We do know that models such as o1 do hallucinate,†Buckley added, “but in the significant majority of cases, we are finding that the model is suggesting something at least helpful, and then in a huge amount of cases, it’s suggesting the exact diagnosis in the original case.†Manrai, Buckley’s coauthor, added: “My mantra is still â€~trust, but verify.’â€
[5]
In real-world test, an AI model did better than ER doctors at diagnosing patients
Researchers tested an AI model against ER doctors and found the model outperformed the humans. shapecharge/E+/Getty Images hide caption A patient shows up at the hospital with a pulmonary embolism -- a blood clot that has traveled to the lungs. After initially improving, their symptoms start to worsen. The medical team suspects the medication isn't working. In steps artificial intelligence -- with its own theory. It has scanned the medical records and suspects a history of lupus, an autoimmune condition which can lead to heart inflammation, could explain what was really ailing the patient. Turns out, the AI model is correct. This type of scenario could become a reality in the-not-too-distant future, according to a study published Thursday in the journal Science. Researchers based at Harvard Medical School and Beth Israel Deaconess Medical Center found that an AI reasoning model, developed by OpenAI, excelled at diagnosing patients and making decisions about managing their care. It matched and often outperformed doctors and the earlier AI model, Chat GPT-4. The researchers ran a series of experiments on the AI model to test its clinical acumen -- including actual cases like the lupus patient who'd been previously treated at the emergency department at Beth Israel in Boston. The team graded how well the AI model could provide an accurate diagnosis at three moments in time, from the triage stage in the ER, up to being admitted into the hospital. Overall, AI outperformed two experienced physicians -- and did so with only the electronic health records and the limited information that had been available to the physicians at the time. "This is the big conclusion for me -- it works with the messy real-world data of the emergency department, " said Dr. Adam Rodman, a clinical researcher at Beth Israel and one of the study authors. "It works for making diagnoses in the real world." Other parts of the study relied on tricky case reports published in the New England Journal of Medicine and clinical vignettes to suss out whether the AI model could meet well-established "benchmarks" and game out thorny diagnostic questions. "The model outperformed our very large physician baseline," said Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School who was also part of the study. The authors emphasize the research relied on text alone, while in real life, clinicians need to attend to many other inputs like images, sounds and nonverbal cues when diagnosing and treating a patient. Still, the work showcases just how far the technology has advanced in the last few years. Prior generations of large language models faltered when dealing with uncertainty, and in generating a list of possible conditions to check up, what's known as a differential diagnosis. "This paper is a beautiful summary of just how much things have improved," says Dr. David Reich, chief clinical officer for Mount Sinai Health System in New York, who was not involved in the work. "You have something which is quite accurate, possibly ready for prime time," he says. "Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?" After all, arriving at some tricky, final diagnosis -- which the AI model shines at -- isn't necessarily reflective of how things play out "in real clinical medicine," says Reich, where the "outcomes are much more subtle and perhaps more diverse." And the emergency department is only a small portion of the patient's total medical care. Rodman acknowledges it's unlikely AI would have done such an "impressive" job had the team provided it with the records of someone who'd spent a month in the hospital. None of those involved in the new study believe the findings support supplanting doctors with AI, "despite what some companies are likely to say and how they're likely to use these results," says Manrai. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine," he adds. But the results do make the case that AI models need to be tested in a rigorous fashion, ideally through forward-looking trials that can give more certainty about how the technology ultimately impacts clinical practice. "It's a very challenging process to design these trials," says Reich, "but this study is a perfect call to action."
[6]
AI outperforms doctors in Harvard trial of emergency triage diagnoses
Researchers say results mark a 'profound change in technology that will reshape medicine' From George Clooney in ER to Noah Wyle in The Pitt, emergency department doctors have long been popular heroes. But will it soon be time to hang up the scrubs? A groundbreaking Harvard study has found that AI systems outperformed human doctors in high-pressure emergency medicine triage, diagnosing more accurately in the potentially life and death moments when people are first rushed to hospital. The results were described by independent experts as showing "a genuine step forward" in the clinical reasoning of AIs and came as part of trials that tested the responses of hundreds of doctors against an AI. The authors said the results, published in the journal Science, showed large language models (LLMs) "have eclipsed most benchmarks of clinical reasoning". One experiment focused on 76 patients who arrived at the emergency room of a Boston hospital. An AI and a pair of human doctors were each given the same standard electronic health record to read - typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time. It showed the AIs' advantage was particularly pronounced in triage circumstances requiring rapid decisions with minimal information. The diagnosis accuracy of the AI - OpenAI's o1 reasoning model - rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant. It also outperformed a larger cohort of human doctors when asked to provide longer term treatment plans, such as providing antibiotics regimes or planning end-of-life processes. The AI and 46 doctors were asked to examine five clinical case studies and the computer made significantly better plans, scoring 89% compared with 34% for humans using conventional resources, such as search engines. But it is not curtains for emergency doctors yet, the researchers said. The study only tested humans against AIs looking at patient data that can be communicated via text. The AI's reading of signals, such as the patient's level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork. "I don't think our findings mean that AI replaces doctors," said Arjun Manrai, one of the lead authors of the study who heads an AI lab at Harvard Medical School. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine." Dr Adam Rodman, another lead author and a doctor at Boston's Beth Israel Deaconess medical centre where the study took place, said AI LLMs were among "the most impactful technologies in decades". Over the next decade, he said, AI would not replace physicians but join them in a new "triadic care model ... the doctor, the patient, and an artificial intelligence system". In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms. Human doctors thought the anti-coagulants were failing, but the AI noticed something the humans did not: the patient's history of lupus meant this might be causing the inflammation of the lungs. The AI was proved correct. Nearly one in five US physicians are already using AI to assist diagnosis, according to research published last month. In the UK, 16% of doctors are using the tech daily and a further 15% weekly, with "clinical decision-making" being one of the most common uses, according to a recent Royal College of Physicians survey. The UK doctors' biggest concerns were AI error and liability risks. Billions are being invested in AI healthcare companies, but questions remain about the consequences of AI error. "There is not a formal framework right now for accountability," said Rodman, who also stressed patients ultimately "want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions". Prof Ewen Harrison, co-director of the University of Edinburgh's centre for medical informatics, said the study was important and showed that "these systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important." Dr Wei Xing, an assistant professor at the University of Sheffield's school of mathematical and physical sciences, said some of the other findings suggested doctors may unconsciously defer to the AI's answer rather than thinking independently. "This tendency could grow more significant as AI becomes more routinely used in clinical settings," he said. He also highlighted the lack of information about which patients the AI was worse at diagnosing and whether it struggled more with elderly patients or non-English speakers. He said: "It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice."
[7]
Study: AI can outperform doctors on diagnosing cases
AI performed as well or better than physicians in new study. Credit: Rawlstock via Moment / Getty Images Artificial intelligence that can "reason" is now capable of diagnosing real-life medical scenarios as well as or better than physicians, according to the results of a study published Thursday in Science. The researchers used previously unknown clinical cases to test OpenAI's reasoning model o1 against the company's older model, GPT-4, as well as physicians and medical residents in training. In a range of experiments, the o1 model often improved significantly on GPT-4's diagnostic ability and bested physicians, too. When tested with the electronic health records of random emergency department cases from a Boston hospital, the o1 model was diagnostically accurate more than two-thirds of the time at initial triage. Two expert attending physicians had correct diagnoses roughly half of the time. Dr. Robert Wachter, professor and chair of the Department of Medicine at the University of California, San Francisco, described the study's findings as "important" and suggested it's now "indisputable" that modern AI will outperform older large language models and doctors when asked to identify the right diagnosis and next step. He was not involved in the study. However, Wachter, author of "A Giant Leap: How AI is Transforming Healthcare and What That Means for Our Future," added that more research is necessary before AI is fully implemented in clinical practice. "The question is how closely this replicates real life, and the answer is moderately well but not perfectly," Wachter wrote in an email. As the study's authors acknowledge, the experiments were limited to text-only input and didn't include the visual and auditory clues and cues that doctors often rely on for diagnosis. These can include a patient's level of distress and medical imaging. "GenAI can probably begin to integrate these inputs but for now, a test of a written, and often artificially 'clean' clinical case scenario is not the same as going into an ER and dealing with the chaos," Wachter said. "Just watch The Pitt." Based on their findings, the study's authors highlighted an "urgent" need for further studies and prospective clinical trials to determine how AI systems can improve clinical practice and patient outcomes. "The rapid pace of improvement in LLMs has substantial implications for the science and practice of clinical medicine," wrote the authors, many of whom are based at Boston's Beth Israel Deaconess Medical Center, where the study was conducted. An accompanying article, also published in Science and written by two experts at Flinders Health and Medical Research Institute in Adelaide, Australia, who were not involved in the study, agreed with its urgent implications. They also argued against replacing doctors with AI, instead envisioning a style of collaboration that provides oversight, contextual judgment, and accountability. "Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use," the experts wrote.
[8]
Landmark Test of Clinical Reasoning Finds AI Outperformed | Newswise
AI can pass the hardest exams medical school has to offer. But can it handle the real world's inherent messiness? Harvard Medical School and Beth Israel Deaconess Medical Center researchers sought to find out. Newswise -- BOSTON - In one of the largest studies to compare artificial intelligence and physicians on a wide array of clinical reasoning tasks including real emergency department data, a team of physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center evaluated whether an AI system could do what physicians do every day: review a messy patient chart and use that information to determine diagnosis and next steps. In a new study published April 30, 2026 in Science, co-senior authors Arjun (Raj) Manrai, assistant professor of biomedical informatics at HMS and Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC and team report that a large language model (LLM) outperformed physicians across many common clinical reasoning tasks including emergency room decisions, identifying likely diagnoses, and choosing next steps in management. The LLM's performance indicated that long‑standing ways of testing medical AI may no longer capture current systems' performance, pointing to a possible turning point for the field. "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said co-senior author Manrai. "However, this does not mean AI will necessarily improve care -- how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice." "Models are increasingly capable," said Peter Brodeur, MD, MA, the study's co‑first author. "We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can't track progress anymore because we're already at the ceiling." Incorporating standards first created in the 1950s to train and evaluate doctors, the researchers compared how an AI system performed against hundreds of clinicians. The comparisons included case study diagnostic challenges, reasoning exercises, and real emergency department cases. In one of their experiments, the investigators tasked the LLM with evaluating patients at various points in a standard emergency department setting, ranging from early triage to later admission decisions. At each stage, the model was given only the information available at that point -- drawn directly from real‑world electronic health records -- and asked to generate likely diagnoses and suggest what should happen next. "To better understand real-world performance, we needed to test performance early in the patient course, when clinical data is sparse," said co-first author Thomas Buckley, Harvard Kenneth C. Griffin School of Arts and Science doctoral student, Dunleavy Fellow in HMS' AI in Medicine PhD program, and a member of Manrai's lab. Unlike many prior studies, the team did not smooth out the messiness of real‑world care before testing the AI. The emergency department cases were presented exactly as they appeared in the electronic health record. "We didn't pre‑process the data at all," Rodman said. "The model is literally just processing data as it exists in the health record." At the early decision points in the real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy. That result surprised even the researchers. "I thought it was going to be a fun experiment but that it wouldn't work that well," Rodman said. "That was not at all what happened." The results make the case that medical AI is ready to be studied the same way as all new medical interventions -- through carefully controlled clinical trials in real care settings. The researchers are clear that their results do not suggest that AI systems are ready to practice medicine autonomously, or that physicians can be removed from the diagnostic process. "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm," said Brodeur. "Humans should be the ultimate baseline when it comes to evaluating performance and safety." About Harvard Medical School Harvard Medical School brings together the brightest minds in science and medicine to improve health and well-being for all. The school and its affiliated hospitals and research institutions are home to 12,000 faculty members and 1,600 medical and graduate students. Together, they function as a magnet, pulling together the best and most passionate researchers, clinicians, students, and changemakers in science, medicine, and health. About Beth Israel Deaconess Medical Center Beth Israel Deaconess Medical Center is a leading academic medical center, where extraordinary care is supported by high-quality education and research. BIDMC is a teaching affiliate of Harvard Medical School and consistently ranks as a national leader among independent hospitals in National Institutes of Health funding. BIDMC is the official hospital of the Boston Red Sox. Beth Israel Deaconess Medical Center is a part of Beth Israel Lahey Health, a healthcare system that brings together academic medical centers and teaching hospitals, community and specialty hospitals, more than 4,700 physicians and 39,000 employees in a shared mission to expand access to great care and advance the science and practice of medicine through groundbreaking research and education.
Share
Copy Link
A groundbreaking study published in Science reveals that OpenAI's o1-preview reasoning model achieved 67.1% diagnostic accuracy on real emergency department cases, surpassing two expert physicians who scored 55.3% and 50%. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center emphasize that while AI in medicine shows remarkable potential, the findings point toward collaborative care models rather than physician replacement.
A landmark study published in the Journal of Science demonstrates that AI in medicine has achieved a significant milestone, with OpenAI's o1-preview model matching or exceeding physician-level clinical diagnostic reasoning on authentic medical cases
1
. Researchers led by Arjun Manrai from Harvard Medical School and Adam Rodman from Beth Israel Deaconess Medical Center tested the OpenAI o1-preview model across six experiments, including 76 actual emergency department cases and 143 complex clinical vignettes published in The New England Journal of Medicine2
.The results reveal striking advances in AI diagnostic accuracy. When evaluating real emergency room patients at triage, the AI as a diagnostic tool achieved 67.1% exact or very-close diagnostic accuracy, while two expert attending physicians scored 55.3% and 50.0% respectively
4
. Blinded physician reviewers could not distinguish the AI output from human diagnoses. On published clinical vignettes, the o1-preview model included the correct diagnosis in its differential in 78.3% of cases and suggested a helpful diagnosis in 97.9% of cases, vastly outperforming GPT-4, which achieved 72.9% accuracy1
.
Source: CNET
The OpenAI o1-preview model represents a new class of reasoning models—Large Language Models (LLMs) enhanced with the capability to work through complex problems step by step before responding, mirroring structured thinking
1
. This deliberative approach proved particularly effective during early-stage triage when decisions must be made with limited information. The model handled uncertainty far better than human clinicians, using fragmented or unstructured electronic health records and notes more effectively2
.Rodman described a compelling case where a patient presented with routine respiratory symptoms after an organ transplant. The AI model suspected a dangerous flesh-eating infection from the very beginning, approximately 12 to 24 hours before human physicians would have become suspicious of this condition
3
. In another instance, when a pulmonary embolism patient's symptoms worsened despite treatment, the AI correctly identified lupus-related heart inflammation as the underlying cause by scanning medical records5
.
Source: Science News
Despite the impressive performance, researchers emphasize that AI outperforms doctors in specific contexts but should not replace them. "I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say," Manrai stated during a press briefing
2
. The prevailing proposal for AI in emergency medicine focuses on collaborative care models, with clinicians providing oversight, contextual judgment, and accountability1
.Prior research using clinical vignettes found no substantial difference between physicians augmented with GPT-4 and GPT-4 working alone, though both outperformed physicians with only conventional resources
1
. This suggests that determining optimal implementation requires evaluating AI alone, clinician alone, and clinician with AI—a critical consideration as clinicians already integrate AI tools into practice, sometimes without institutional oversight1
.
Source: Mashable
Related Stories
While the study establishes a foundation for authentic evaluation across text-based tasks, real clinical work relies heavily on visual and auditory cues from physical examinations
1
. The o1 models were limited to text-only input and currently underperform on most medical imaging benchmarks4
. Newer multimodal AI systems like GPT-5.3 and Gemini 3.1 Pro can process text, images, audio, and video together, potentially enabling assessments that more closely mirror actual clinical diagnosis1
.Separate research by Arya Rao at Harvard Medical School identified a persistent weak point in AI reasoning: considering several different uncertain diagnoses. LLM-based models tend to jump to conclusions, with reasoning that is "brittle precisely where uncertainty and nuance matter most"
3
. Concerns about AI hallucinations and patient safety also persist, with researchers noting that AI can spontaneously develop unexpected behaviors and provide incorrect information4
.The findings indicate an urgent need to understand how these tools can be safely integrated into clinical workflows through prospective clinical trials
1
. "We're witnessing a really profound change in technology that will reshape medicine, and we need to evaluate this technology now, and rigorously conduct in prospective clinical trials," Manrai emphasized2
. Regulators, hospitals, and healthcare providers must work together to test these tools thoroughly before deployment to ensure patient safety and health equity for all patients2
.Researchers at Flinders University wrote in a concurrent Science commentary that "we do not allow doctors to practice without supervision and evaluation, and AI should be held to comparable standards"
2
. As of 2025, 1 in 5 doctors and nurses worldwide used AI for a second opinion on complex cases, with over half wanting to use it for this purpose3
. With such widespread interest, establishing decision support systems that balance AI capabilities with human expertise becomes critical for the future of medicine.Summarized by
Navi
[1]
05 Apr 2025•Health

30 Jun 2025•Technology

23 Jul 2024

1
Policy and Regulation

2
Science and Research

3
Entertainment and Society
