16 Sources
[1]
In Harvard study, AI offered more accurate diagnoses than emergency room doctors | TechCrunch
A new study examines how large language models perform in a variety of medical contexts, including real emergency room cases -- where at least one model seemed to be more accurate than human doctors. The study was published this week in Science and comes from a research team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted a variety of experiments to measure how OpenAI's models compared to human physicians. In one experiment, researchers focused on 76 patients who came into the Beth Israel emergency room, comparing the diagnoses offered by two attending physicians to those generated by OpenAI's o1 and 4o models. These diagnoses were assessed by two other attending physicians, who did not know which ones came from humans and which came from AI. "At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o," the study said, adding that the differences "were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision." In Harvard Medical School's press release about the study, the researchers emphasized that they did not "pre-process the data at all" -- the AI models were presented with the same information that was available in the electronic medical records at the time of each diagnosis. With that information, the o1 model managed to offer "the exact or very close diagnosis" in 67% of triage cases, compared to one physician who had the exact or close diagnosis 55% of the time, and to the other who hit the mark 50% of the time. "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the study's lead authors, in the press release. To be clear, the study didn't claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, it said the findings show an "urgent need for prospective trials to evaluate these technologies in real-world patient care settings." The researchers also noted that they only studied how models performed when provided with text-based information, and that "existing studies suggest that current foundation models are more limited in reasoning over nontext inputs." Adam Rodman, a Beth Israel doctor who's also one of the study's lead authors, told the Guardian that there's "no formal framework right now for accountability" around AI diagnoses, and that patients still "want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions".
[2]
AI can reason like a physician -- what comes next?
Large language models (LLMs) are artificial intelligence (AI) algorithms that are trained on vast amounts of data to learn patterns that enable them to generate human-like responses. Reasoning models are LLMs with the added capability of working through problems step by step before responding, thus mirroring structured thinking. Such AI systems have performed well in assessing medical knowledge, but whether they can match physician- level clinical reasoning on authentic diagnostic tasks remains largely unknown. On page 524 of this issue, Brodeur et al. (1) demonstrate that AI can now seemingly match or exceed physician-level clinical diagnostic reasoning on text-based scenarios by measuring against human physician performances on clinical vignettes and real-world emergency cases. The findings indicate an urgent need to understand how these tools can be safely integrated into clinical workflows, and a readiness for prospective evaluation alongside clinicians. AI has the potential to support a broad range of health care applications, from clinical decisions to medical education and the provision of patient-facing health information. LLMs have passed medical licensing examinations and performed well on structured clinical assessments, raising the prospect that they could help alleviate global health care workforce shortages. However, passing examinations is not the same as being a doctor, and demonstrating physician-level performance on authentic clinical tasks is a fundamentally harder challenge (2). Brodeur et al. evaluated OpenAI's first reasoning model, o1-preview (released in September 2024), across five experiments that assess diagnostic performance on clinical case vignettes against physician and prior-model baselines. A sixth experiment compared o1 with prior models, and physicians across three diagnostic touchpoints on 76 actual emergency department cases. Across the experiments, the o1 models substantially outperformed prior-generation nonreasoning LLMs (e.g., GPT-4) and, in many cases, the physicians themselves. For example, when provided with published clinicopathological conference cases, GPT-4 achieved exact or very close diagnostic accuracy in 72.9% of cases, whereas o1-preview achieved this in 88.6% of cases. Further, in actual emergency department cases, o1 achieved 67.1% exact or very-close diagnostic accuracy at initial triage, outperforming two expert attending physicians (55.3% and 50.0%), with blinded reviewers unable to distinguish the AI output from human. This advance sets a new evaluation benchmark -- testing AI against physician performance, and ideally alongside physicians, on authentic clinical tasks. Although the o1 models were limited to text-only input, their reasoning capabilities, deliberation time, and ability to process multimodal inputs have improved substantially in more recent models, expanding the complexity of tasks they can undertake. Notably, reasoning models such as GPT-5.3 and Gemini 3.1 Pro now process text, images, audio, and video together. Brodeur et al. establish a foundation for authentic evaluation across text-based tasks, but clinical practice inherently involves visual and auditory cues, such as findings from physical examinations. Multimodal AI offers the potential for assessments that more closely mirror an actual clinical diagnosis in practice (2). Future work should therefore evaluate the latest models on Brodeur's scenarios, test multimodal capabilities using vignettes that incorporate visual and auditory data, and progress toward prospective clinical assessments. Although the findings of Brodeur et al. indicate that AI can seemingly perform diagnostic tasks as well as, if not better than, physicians in specific contexts, the prevailing proposal for AI in health care is not replacement but collaboration, with clinicians providing oversight, contextual judgment, and accountability. That collaborative configuration itself must be tested. In prior work that used clinical vignettes to assess diagnostic and management reasoning, no substantial difference between the performance of physicians augmented with GPT-4 and the GPT-4 model working alone was found, but both outperformed physicians with only conventional resources (3). More broadly, it has been argued that for certain well-defined tasks across health care, AI may operate more effectively independently (4). Indeed, determining the optimal implementation will likely require an evaluation that compares AI alone, clinician alone, and clinician with AI. With clinicians already integrating AI tools into practice, in some cases without institutional oversight (5), the evidence generated by this triad will be essential for determining when AI integration improves care and when it does not. Brodeur et al. focused on diagnostic reasoning tasks, but this is only one domain in which medical AI is being developed. One proposed framework [Medical Holistic Evaluation of Language Models (Med- HELM)] uses a taxonomy that includes five domains: clinical decision support, clinical note generation, patient communication, medical research assistance, and administrative workflows (6). Across these areas, AI models are evolving from static question-and-answer tools into agents that can, for example, analyze patient records, monitor clinical encounters through ambient listening, and interact in real time with predictive models built on patient data. Whatever the application, the benchmark for use in clinical practice cannot be synthetic performance; it must be improvement in real-world applications, ideally demonstrated through randomized trials. A clinical certification pathway for AI modeled on physician training has also been proposed (7). This pathway progresses AI from medical knowledge assistants to specialty task performance, supervised clinical practice, and, ultimately, broader autonomous scope. The study of Brodeur et al. is a step along this pathway, demonstrating that reasoning models are advancing from knowledge platforms to specialty task performance. The next stage should extend this evaluation to multimodal AI in supervised clinical settings. Although evaluation methods are progressing, the deployment of AI systems is outpacing them. Accuracy on a validated task does not guarantee that a deployed system will confine itself to that task. For example, in January 2026, OpenAI launched ChatGPT Health, a consumer AI tool promoted as a personalized health information source that can address more than 40 million health-related questions submitted each day. The tool was not designed for clinical triage, yet it did not refuse the triage tasks; the first independent evaluation found that it under-triaged more than half of emergencies presented to it (8). The authors of that evaluation rightly argue that independent evaluation of consumer health AI is essential and that there was a lack of clarity as to what ChatGPT Health was for; however, independent evaluations must be rigorous enough to support actionable conclusions. Without physician comparators such as those used by Brodeur et al., it remains unclear whether a clinician, given the same information, would have performed better, limiting what the medical community can recommend. Clear task definitions and transparent human benchmarks equip the medical community to hold developers accountable for performance on defined clinical tasks. Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring. The Journal of the American Medical Association (JAMA) summit on AI in health in 2024 concluded that most health AI efforts still fail to demonstrate real-world effectiveness (an ability to improve outcomes in practice, not just perform well on benchmarks) or equity, calling for multistakeholder engagement, robust measurement tools, data infrastructure that reflects diverse populations, and policy and transparency incentives that drive evaluations targeted at priority issues (9). The risks of not meeting these recommendations are documented, with examples including a widely used health care algorithm that exhibited substantial racial bias affecting equitable health expenditure (10); suboptimal health safeguards of publicly accessible AI tools (11, 12); and biased AI that decreased clinicians' diagnostic accuracy (13). Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use.
[3]
AI Outperforms ER Doctors in Diagnostic Cases, Study Points to Collaborative Care
Macy has been working for CNET for coming on 2 years. Prior to CNET, Macy received a North Carolina College Media Association award in sports writing. Have you ever thought about how artificial intelligence compares to a human physician in an emergency diagnostic setting? New research published Thursday might have you thinking over this question. The study, published in the journal Science, found that a state-of-the-art large language model outperformed human doctors on a range of common clinical tasks. Using real emergency department data and hundreds of physician comparisons, the model matched or even exceeded human clinician performance in diagnostic choices, emergency triage and determining next steps in management. The authors of the study said those results do not mean AI models are ready to replace human doctors. Instead, the results indicate that industry professionals need faster, more rigorous standards for evaluation and rules for using AI in medicine. The researchers tested OpenAI's o1 series large language model, released in 2024, across six experiments that blended standardized clinical cases with a real-world sample of randomly selected emergency room patients at a medical center in Massachusetts. The model's advantage was most evident in early-stage triage, when decisions must be made with little information. Both the human clinicians and the AI model improved as more data became available to them, but the study found that the LLM handled uncertainty far better, using fragmented or unstructured health data and notes more effectively. These findings build on decades of using difficult diagnostic cases to evaluate medical-computing systems. Earlier LLMs already outperformed older algorithmic approaches, but what sets this study apart is the scale and the head-to-head comparison between a human doctor and AI in a real clinical scenario. The authors stressed that we should remain skeptical of these results. Real clinical work in hospitals and emergency rooms often relies on visual and auditory cues -- rather than text-based reasoning -- which AI cannot interpret fully and accurately. "Future work is needed to assess how humans and machines may effectively collaborate in the use of nontext signals," the study notes. When considering AI-assisted medical care, it's also critical to assess whether it will be safe, equitable and cost-effective, aspects that were not tested in this study. Read also: If AI Health Advice From Apple Is Coming, I Want to Be Ready "Long story short, the model outperformed our very large physician baseline. You'll see this in detail, but this included board-certified, actively practicing physicians and real messy cases," Arjun Manrai, an assistant professor of Biomedical Informatics at Harvard Medical School, said during a virtual press briefing call. "I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say, and how they're likely to use these results," Manrai said. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine, and that we need to evaluate this technology now, and rigorously conduct in prospective clinical trials." Regulators, hospitals and healthcare providers should work together to test these tools thoroughly before they're deployed to ensure safety and equity for all patients. In a commentary also published Thursday in Science, Ashley M. Hopkins and Eric Cornelisse, researchers at Flinders University in Australia, wrote that the study is a step toward better evaluation of AI systems in healthcare, but that medicine is a complex field that requires rigorous oversight to ensure patients receive the best possible care. "We do not allow doctors to practice without supervision and evaluation, and AI should be held to comparable standards," Cornelisse said in a statement.
[4]
Can AI help doctors avoid missed diagnoses? A new study suggests yes
Humans still have important roles to play in medicine, experts stress In some of medicine's toughest cases, the hardest part isn't choosing the right diagnosis. It's thinking of it at all. Artificial intelligence may now be better at that than doctors, a new study suggests. "We're witnessing a really profound change in technology that will reshape medicine," Harvard University biomedical data scientist Arjun Manrai said in an April 28 news conference. That change is driven by advances in large language models, the same technology OpenAI's ChatGPT is built on. New versions, called reasoning models, can work through complex problems step by step. As of 2025, 1 in 5 doctors and nurses worldwide used AI for a second opinion on complex cases, and over half want to use it for this purpose, according to a survey of more than 2,000 clinicians. But how well the technology works in a medical setting has been debated. Manrai and colleagues tested OpenAI's o-1 preview model on a range of medical cases, including classic sets of symptoms used in medical training as well as real-world data directly from the charts of 76 patients who visited an emergency room in Boston. Across those clinical reasoning tests, the AI model was more likely than physicians to include the correct diagnosis, or something very close to it, among its possible answers, the researchers report April 30 in Science. Not all researchers are convinced that this means we should trust AI with our diagnoses, arguing that AI reasoning is still far from what human doctors can do. "When we say clinical reasoning, it doesn't mean the same thing as moral reasoning," says Arya Rao, a researcher at Harvard Medical School, who was not involved in the study. "These models have been optimized to do this kind of sequential thought that we call reasoning, but it's not at all the same thing as how we teach medical students to reason." Manrai is not opposed to the critique, noting AI technology should assist rather than replace people in medical roles. "Ultimately, I think humans want humans to guide them ... through challenging treatment decisions," he said. Still, the results show that this type of AI "works for making diagnoses in the real world," coauthor Adam Rodman, a doctor at Beth Israel Deaconess Medical Center in Boston, said at the news conference. He described a patient who came into the emergency room with what seemed like routine respiratory symptoms and had recently undergone an organ transplant and was immunosuppressed. The patient turned out to have a dangerous flesh-eating infection requiring surgery. "The model actually was suspicious of this [infection] from the very beginning, probably 12 to 24 hours before the human physician would have become suspicious of this," Rodman said. Rao applauds the team for presenting [AI] "as an extension of a physician, not a replacement." She calls the study "rigorous and thoughtful." However, she does not think there's enough evidence to say that AI models have aced clinical reasoning. Her team released a study April 13 that tested 21 AI models at each step of the process toward reaching a diagnosis. Reasoning models got the highest scores overall. But when Rao's team drilled down to identify which parts of the diagnostic process were trickiest for AI, the researchers found a weak point that persisted from the oldest models to the newest. That's the process of considering several different uncertain diagnoses. AI models based on LLMs tend to jump to conclusions. "Their reasoning is brittle precisely where uncertainty and nuance matter most," Rao and her team wrote in their paper. Their conclusion was that LLMs are not yet ready to make decisions in medical settings. These two studies evaluated different AI models in different ways. Yet, the results aren't as opposed as they may seem on the surface, both teams say. They agree that the next step should be more research. Manrai's team is planning clinical trials to help answer the question: "How do we safely and thoughtfully integrate [AI] into care?" Rao likes that approach. So many people "don't have enough access to care," she says. Someday, she notes, "I think AI can be a great equalizer."
[5]
AI in the emergency department: promising, powerful but still unproven
Artificial intelligence can now outperform doctors at diagnosing patients in the emergency department, according to a new study in Science. The AI was given written notes from real emergency department records from a hospital in Boston, US, and asked to weigh in at different points during the patient's care. At the earliest stage - triage, when a patient first arrives - the AI identified the correct diagnosis, or something closely related, in 67% of cases. The two doctors used for comparison managed 50% and 55%. That's a meaningful gap, especially at the moment when information is scarcest and uncertainty is highest. This study matters because the field is moving so fast. Earlier research showed that large language models - the technology behind systems like ChatGPT - could pass medical licensing exams. Interesting, but not all that illuminating. Passing an exam is not the same as being useful on a ward. This new study goes further. It puts AI alongside doctors across several tasks, using genuine clinical text from a real emergency department. That makes it more directly relevant to medical practice than most of what's come before. It suggests these systems are developing into something that could genuinely help doctors think through a wide range of possible diagnoses, especially in situations where missing a serious condition is the main concern. There are good reasons, though, not to get carried away. The AI was working entirely from written text. It never saw the patient, never noticed how breathless or frightened they looked, never examined them, spoke to their family, weighed up the chaos of a busy department, or took any responsibility for what happened next. It was not practising emergency medicine. It was offering a written opinion based on selected information. There's also a gap between producing a list of possible diagnoses and actually improving patient outcomes. A longer list might help a doctor think more broadly, but it could equally generate new problems: unnecessary tests, over-treatment, extra workload, or unwarranted confidence in an answer that sounds plausible but turns out to be wrong. And some of the benchmark cases used in studies like this may have been publicly available when the AI was trained, which doesn't undermine the emergency department findings, but is another reason to treat headline numbers with some scepticism. The hard question So the question isn't really whether AI can help doctors think through difficult cases. The harder question is how this should be tested and governed in real clinical settings like the NHS. That question is already urgent. A Royal College of Physicians snapshot found that 16% of UK doctors were using AI tools in clinical practice every day, with another 15% doing so weekly. Doctors are already using these tools in their daily work - before hospitals and health systems have properly worked out how to assess them, train staff to use them safely, spot when they're causing harm, or decide who is responsible when something goes wrong. It's tempting to say that the solution is to keep a human in the loop. But that phrase does very little work on its own. We need to know which human, in which loop, and with what authority. A doctor's ability to override an AI suggestion is not, by itself, a safety system. Someone still has to decide which tools get used, who can change how they behave, how harms are spotted, and who is responsible when the tool quietly starts failing. This study represents genuine progress. But it doesn't, on its own, change how medicine should be practised. The right response is neither to prohibit these systems nor to let them quietly become part of the routine before anyone has thought it through. They should be trialled in real clinical settings, used as a form of second-opinion support rather than a substitute for clinical judgment, and measured against what actually matters to patients: care that is better, safer and faster.
[6]
Large language model outperforms human doctors in clinical reasoning tasks
American Association for the Advancement of Science (AAAS)Apr 30 2026 A cutting-edge large language model (LLM) outperformed human doctors in common clinical reasoning tasks including emergency room decisions, identifying likely diagnoses, and choosing next steps in management, according to a new study that used real emergency department data. The authors of the study - one of the largest studies to date to compare artificial intelligence and physicians on a wide array of clinical reasoning tasks - are clear that their results do not mean AI systems are ready to practice medicine on their own, or that doctors can be removed from the diagnostic process. The results do, however, raise urgent questions about the future evaluation and implementation of artificial intelligence (AI) tools in clinical care. For more than 65 years, difficult clinical diagnostic cases have been the gold standard for evaluating medical computing systems. Most recently, LLMs have surpassed earlier computational approaches on these complex cases. However, despite this progress, most medical studies of LLMs have examined narrow or highly controlled scenarios and often lacked direct comparison to the performance of human physicians in real-world clinical reasoning tasks. The rapid advancement of LLM-based medical tools now necessitates more rigorous evaluation. Here, Peter Brodeur and colleagues comprehensively evaluated the diagnostic and treatment-planning abilities of an advanced LLM - the OpenAI o1 series - by comparing its performance to that of hundreds of physicians and earlier AI systems, across a range of clinical reasoning tasks. These included both standardized clinical cases and a real-world study involving randomly selected emergency room patients at a major emergency medical center in Massachusetts. Brodeur et al. found that, across all six experiments, the LLM model consistently matched or exceeded human performance in diagnostic and management reasoning. Notably, its advantage was most pronounced in early-stage emergency department triage, where clinicians must make rapid decisions with minimal information. While both humans and AI improved as more clinical data became available, the model demonstrated a distinct strength under conditions of uncertainty, using even fragmented, unstructured health record data effectively. According to the authors, LLMs are rapidly approaching, and in some areas surpassing, human-level clinical reasoning, and although AI-assisted decision-making is often viewed as risky, the findings suggest such tools - when used in collaboration with physicians' assessments - could reduce diagnostic errors, delays, and disparities in access to care. However, the authors also note several important limitations of the study. For example, its focus was confined to text-based reasoning, whereas clinical practice depends heavily on visual and auditory cues, areas where current AI remains less capable. "Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring," write Ashley Hopkins and Erik Cornelisse in a related Perspective. "Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use." American Association for the Advancement of Science (AAAS) Journal reference: DOI: 10.1126/science.adz4433
[7]
AI Just Beat Doctors at Diagnosing ER Patients. Don't Get All Excited
Emergency departments and other clinical settings across the world are now one step closer to sounding like the cockpit of the Millennium Falconâ€"with human doctors soliciting advice from, bickering with, and not infrequently trusting the guidance of their opinionated AI colleagues. Researchers at Harvard and Boston’s Beth Israel Deaconess Medical Center have successfully tested an advanced large language model (LLM) AI against two attending physicians (humans) in their performance diagnosing incoming emergency room patients at the triage phase. The LLM, OpenAI’s first so-called “reasoning†model o1-preview, made the correct call in 67.1% of the 76 actual emergency department cases put to it, with what the researchers called “exact or a very close†diagnostic accuracy in the new study, published today in the journal Science. Two expert physicians sourced from elite university medical institutions, however, only scored 55.3% and 50.0% accuracy, respectively, with blinded physician reviewers unable to tell these o1 and human-made diagnoses apart. The new study also pitted o1 and OpenAI’s prior non-reasoning LLMs, like ChatGPT-4, against physicians' past testing baselines diagnosing 143 complex cases published as clinical vignettes in The New England Journal of Medicine. “o1-preview included the correct diagnosis in its differential in 78.3% of these cases,†according to one of the study’s lead authors, doctoral candidate Thomas Buckley with Harvard Medical School’s Department of Biomedical Informatics, who spoke at a press briefing Tuesday. “And when expanding to a differential diagnosis that would have been helpful,†Buckley continued, “we found that o1-preview suggested a helpful diagnosis in 97.9% of cases.†The results, he noted, not only outperformed ChatGPT-4 but also vastly outpaced a human physician baseline published in Nature, where physicians with the freedom to consult search engines and standard medical resources had an accuracy of 44.5%. (Although, this study included a larger and perhaps more thorny set of 302 clinical vignettes.) “I don't think our findings mean that AI replaces doctors,†study coauthor Arjun Manrai, who teaches biomedical informatics at Harvard, took pains to emphasize at the press briefing, “despite what some companies are likely to say." Manrai did, however, describe the team’s results as evidence of a “really profound change in technology that will reshape medicine,†one that would require rigorous testing to verify their utility in actually making patient outcomes better. Two independent medical researchers, who commented on the new study in a piece published concurrently in Science, echoed this view. “The prevailing proposal for AI in health care is not replacement but collaboration,†they noted, “with clinicians providing oversight, contextual judgment, and accountability.†Study coauthor Adam Rodman, an internal medicine physician at Beth Israel, likened the possible legal status of AI diagnoses to the current paradigm with clinical decision support (CDS), already existing digital tools doctors use while retaining personal culpability for those choices. “I will tell you, as a practicing physician, that would be a limitation to widespread adoption of all of this, if the regulatory system is â€~Just trust me,’†Rodman said at the briefing. “I would have to see extraordinarily strong evidence, such as a randomized controlled trial, where I would do that for my patients.†Reasoning models, like o1-preview, differ from the AI chatbots you might be used to in that these LLMs have been built to work through problems in structured steps, mirroring more deductive thinking, before delivering answers to a prompt. The system still has its limitations, which, according to the researchers, include real difficulty diagnosing medical cases involving multimodal input, meaning images and audio evidence that would easily help a human doctor diagnose a patient’s case. “They're underperforming on most medical imaging benchmarks,†Buckley said. “I think a really active area of research over the next decade is how do we improve the multimodal integration capabilities of these models.†Yujin Potterâ€"an AI research scientist at the University of California, Berkeley, who reviewed the new study for Gizmodoâ€"noted that the team’s finished paper was quiet on more troubling issues now known to plague AI. Potter, who's not involved with the new research, co-published a study in March detailing how teams of AI can spontaneously develop and act on their own goals when tasked to work in coordination, actively deceiving their human users and exfiltrating files to hide on different servers. “This paper is informative. It's good. But also, this actually means that we also need to understand AI safety better,†Potter told Gizmodo. “People should keep in their mind that AI can also hallucinate and give them the wrong informationâ€"and even malicious or misaligned AI can manipulate them.†At the Tuesday briefing, Buckley acknowledged that he and his colleagues “didn't formally measure the hallucination rate of these models.†“We do know that models such as o1 do hallucinate,†Buckley added, “but in the significant majority of cases, we are finding that the model is suggesting something at least helpful, and then in a huge amount of cases, it’s suggesting the exact diagnosis in the original case.†Manrai, Buckley’s coauthor, added: “My mantra is still â€~trust, but verify.’â€
[8]
In real-world test, an AI model did better than ER doctors at diagnosing patients
Researchers tested an AI model against ER doctors and found the model outperformed the humans. shapecharge/E+/Getty Images hide caption A patient shows up at the hospital with a pulmonary embolism -- a blood clot that has traveled to the lungs. After initially improving, their symptoms start to worsen. The medical team suspects the medication isn't working. In steps artificial intelligence -- with its own theory. It has scanned the medical records and suspects a history of lupus, an autoimmune condition which can lead to heart inflammation, could explain what was really ailing the patient. Turns out, the AI model is correct. This type of scenario could become a reality in the-not-too-distant future, according to a study published Thursday in the journal Science. Researchers based at Harvard Medical School and Beth Israel Deaconess Medical Center found that an AI reasoning model, developed by OpenAI, excelled at diagnosing patients and making decisions about managing their care. It matched and often outperformed doctors and the earlier AI model, Chat GPT-4. The researchers ran a series of experiments on the AI model to test its clinical acumen -- including actual cases like the lupus patient who'd been previously treated at the emergency department at Beth Israel in Boston. The team graded how well the AI model could provide an accurate diagnosis at three moments in time, from the triage stage in the ER, up to being admitted into the hospital. Overall, AI outperformed two experienced physicians -- and did so with only the electronic health records and the limited information that had been available to the physicians at the time. "This is the big conclusion for me -- it works with the messy real-world data of the emergency department, " said Dr. Adam Rodman, a clinical researcher at Beth Israel and one of the study authors. "It works for making diagnoses in the real world." Other parts of the study relied on tricky case reports published in the New England Journal of Medicine and clinical vignettes to suss out whether the AI model could meet well-established "benchmarks" and game out thorny diagnostic questions. "The model outperformed our very large physician baseline," said Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School who was also part of the study. The authors emphasize the research relied on text alone, while in real life, clinicians need to attend to many other inputs like images, sounds and nonverbal cues when diagnosing and treating a patient. Still, the work showcases just how far the technology has advanced in the last few years. Prior generations of large language models faltered when dealing with uncertainty, and in generating a list of possible conditions to check up, what's known as a differential diagnosis. "This paper is a beautiful summary of just how much things have improved," says Dr. David Reich, chief clinical officer for Mount Sinai Health System in New York, who was not involved in the work. "You have something which is quite accurate, possibly ready for prime time," he says. "Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?" After all, arriving at some tricky, final diagnosis -- which the AI model shines at -- isn't necessarily reflective of how things play out "in real clinical medicine," says Reich, where the "outcomes are much more subtle and perhaps more diverse." And the emergency department is only a small portion of the patient's total medical care. Rodman acknowledges it's unlikely AI would have done such an "impressive" job had the team provided it with the records of someone who'd spent a month in the hospital. None of those involved in the new study believe the findings support supplanting doctors with AI, "despite what some companies are likely to say and how they're likely to use these results," says Manrai. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine," he adds. But the results do make the case that AI models need to be tested in a rigorous fashion, ideally through forward-looking trials that can give more certainty about how the technology ultimately impacts clinical practice. "It's a very challenging process to design these trials," says Reich, "but this study is a perfect call to action."
[9]
Medical AI matches doctors on diagnoses, but that doesn't make it safe
Advanced medical artificial intelligence (AI) is improving fast. In difficult diagnosis tests built from real patient cases, some AI systems can now match - and sometimes outperform - experienced doctors. That progress is raising larger questions about safety and patient care. A system that performs well on medical exams or text-based cases may still struggle in real clinical settings, where human judgment and physical symptoms matter. A new study highlights both the promise and the risks of using advanced AI in medicine. Inside text-based case files, the system followed symptoms, weighed diagnoses, and produced answers that physicians judged against expert standards. At Flinders University, Ashley M. Hopkins, Ph.D., argued that strong answers alone do not equal safe medical practice when weighed against real clinical risk. The warning centers on a narrow but crucial boundary: text can hide what patients reveal in person. The next challenge is proving the software helps without putting patients at risk. Separate experiments tested OpenAI o1-preview, a reasoning model that worked through medical problems step by step before answering. On published teaching cases, GPT-4 correctly identified or came very close to the correct diagnosis in nearly three-quarters of cases. OpenAI o1-preview, a newer reasoning model, pushed that figure to nearly nine out of ten cases, showing how quickly medical AI performance has improved. Results like these are difficult for hospitals and doctors to ignore, even though strong test scores do not automatically make software clinically reliable. Across 76 real emergency room cases, the system received patient details in stages, much like doctors do during a fast-moving shift. At the first triage stage - when clinicians decide how urgently someone needs care - o1-preview outperformed two senior physicians in matching the final diagnosis. Reviewers also could not reliably tell whether the written reasoning came from a human physician or the software system. For hospitals, that similarity creates both promise and danger because confident language can mask missing bedside details. Real medicine includes a body, a voice, family details, pain behavior, and limitations that are absent from a typed case. Physical examinations change the evidence because touch, breathing sounds, swelling, and movement can confirm or weaken a diagnosis. "Health care decisions are complex, high stakes, and deeply human, and accuracy alone, particularly on just text-based cases, does not make a system safe for patients," said Erik Cornelisse, a Ph.D. candidate at Flinders University College of Medicine and Public Health. Without those checks, a correct-looking answer can still send care in the wrong direction. A diagnosis does not end the work because treatment must account for risks, values, costs, and patient capacity. Clinicians carry responsibility when they choose a test, change a medication, or send someone home with warning signs. Modern care also requires judgment and ethical oversight because someone must weigh tradeoffs when evidence remains incomplete. Support becomes useful only when clinicians understand where the tool is strong, weak, silent, or overconfident. Collaboration sounds simple until a busy doctor accepts a machine's answer without questioning what shaped it. A 2024 trial found that access to GPT-4 did not significantly improve doctors' diagnostic reasoning compared to standard medical resources. In that trial, GPT-4 alone scored higher than doctors using those resources, complicating the idea that software should only serve as an assistant. Such results push health systems to compare clinicians alone, software alone, and clinicians working with software. Poorly tested medical software can cause uneven harm, especially when training data underrepresent certain groups. In 2019, one widely used health care algorithm predicted who needed extra care by using past medical spending as a stand-in for illness. That shortcut assigned less help to Black patients because unequal access to care had already lowered recorded spending. Biased models can spread the same mistakes to thousands of patients before anyone notices. Strong averages can hide dangerous failures, and consumer health tools show how easily narrow tasks can move beyond their intended use. A February triage evaluation tested ChatGPT Health, OpenAI's consumer health chatbot, on how urgently people needed care. Among true emergencies, the tool directed 51.6 percent of cases toward delayed 24- to 48-hour evaluation instead of emergency care. That failure matters because patients often seek quick reassurance before symptoms appear severe or easy to classify. Careful deployment could help doctors sort records, compare diagnoses, and catch important details during stressful shifts. The U.S. Food and Drug Administration already regulates medical products and guides developers toward safer medical artificial intelligence design, while good machine learning practice calls for continuous monitoring as models, patients, and hospital workflows evolve. "Patients deserve technology that improves care in the real world, not systems that only look impressive in studies," said Hopkins. A model that reasons through text can strengthen medicine only if testing follows real patient outcomes instead of headline scores. Researchers and hospitals must measure safer decisions, fairer access, usable workflows, and clear accountability before giving these systems greater authority in clinical care. Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
[10]
AI outperforms doctors in Harvard trial of emergency triage diagnoses
Researchers say results mark a 'profound change in technology that will reshape medicine' From George Clooney in ER to Noah Wyle in The Pitt, emergency department doctors have long been popular heroes. But will it soon be time to hang up the scrubs? A groundbreaking Harvard study has found that AI systems outperformed human doctors in high-pressure emergency medicine triage, diagnosing more accurately in the potentially life and death moments when people are first rushed to hospital. The results were described by independent experts as showing "a genuine step forward" in the clinical reasoning of AIs and came as part of trials that tested the responses of hundreds of doctors against an AI. The authors said the results, published in the journal Science, showed large language models (LLMs) "have eclipsed most benchmarks of clinical reasoning". One experiment focused on 76 patients who arrived at the emergency room of a Boston hospital. An AI and a pair of human doctors were each given the same standard electronic health record to read - typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time. It showed the AIs' advantage was particularly pronounced in triage circumstances requiring rapid decisions with minimal information. The diagnosis accuracy of the AI - OpenAI's o1 reasoning model - rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant. It also outperformed a larger cohort of human doctors when asked to provide longer term treatment plans, such as providing antibiotics regimes or planning end-of-life processes. The AI and 46 doctors were asked to examine five clinical case studies and the computer made significantly better plans, scoring 89% compared with 34% for humans using conventional resources, such as search engines. But it is not curtains for emergency doctors yet, the researchers said. The study only tested humans against AIs looking at patient data that can be communicated via text. The AI's reading of signals, such as the patient's level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork. "I don't think our findings mean that AI replaces doctors," said Arjun Manrai, one of the lead authors of the study who heads an AI lab at Harvard Medical School. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine." Dr Adam Rodman, another lead author and a doctor at Boston's Beth Israel Deaconess medical centre where the study took place, said AI LLMs were among "the most impactful technologies in decades". Over the next decade, he said, AI would not replace physicians but join them in a new "triadic care model ... the doctor, the patient, and an artificial intelligence system". In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms. Human doctors thought the anti-coagulants were failing, but the AI noticed something the humans did not: the patient's history of lupus meant this might be causing the inflammation of the lungs. The AI was proved correct. Nearly one in five US physicians are already using AI to assist diagnosis, according to research published last month. In the UK, 16% of doctors are using the tech daily and a further 15% weekly, with "clinical decision-making" being one of the most common uses, according to a recent Royal College of Physicians survey. The UK doctors' biggest concerns were AI error and liability risks. Billions are being invested in AI healthcare companies, but questions remain about the consequences of AI error. "There is not a formal framework right now for accountability," said Rodman, who also stressed patients ultimately "want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions". Prof Ewen Harrison, co-director of the University of Edinburgh's centre for medical informatics, said the study was important and showed that "these systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important." Dr Wei Xing, an assistant professor at the University of Sheffield's school of mathematical and physical sciences, said some of the other findings suggested doctors may unconsciously defer to the AI's answer rather than thinking independently. "This tendency could grow more significant as AI becomes more routinely used in clinical settings," he said. He also highlighted the lack of information about which patients the AI was worse at diagnosing and whether it struggled more with elderly patients or non-English speakers. He said: "It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice."
[11]
Study: AI can outperform doctors on diagnosing cases
AI performed as well or better than physicians in new study. Credit: Rawlstock via Moment / Getty Images Artificial intelligence that can "reason" is now capable of diagnosing real-life medical scenarios as well as or better than physicians, according to the results of a study published Thursday in Science. The researchers used previously unknown clinical cases to test OpenAI's reasoning model o1 against the company's older model, GPT-4, as well as physicians and medical residents in training. In a range of experiments, the o1 model often improved significantly on GPT-4's diagnostic ability and bested physicians, too. When tested with the electronic health records of random emergency department cases from a Boston hospital, the o1 model was diagnostically accurate more than two-thirds of the time at initial triage. Two expert attending physicians had correct diagnoses roughly half of the time. Dr. Robert Wachter, professor and chair of the Department of Medicine at the University of California, San Francisco, described the study's findings as "important" and suggested it's now "indisputable" that modern AI will outperform older large language models and doctors when asked to identify the right diagnosis and next step. He was not involved in the study. However, Wachter, author of "A Giant Leap: How AI is Transforming Healthcare and What That Means for Our Future," added that more research is necessary before AI is fully implemented in clinical practice. "The question is how closely this replicates real life, and the answer is moderately well but not perfectly," Wachter wrote in an email. As the study's authors acknowledge, the experiments were limited to text-only input and didn't include the visual and auditory clues and cues that doctors often rely on for diagnosis. These can include a patient's level of distress and medical imaging. "GenAI can probably begin to integrate these inputs but for now, a test of a written, and often artificially 'clean' clinical case scenario is not the same as going into an ER and dealing with the chaos," Wachter said. "Just watch The Pitt." Based on their findings, the study's authors highlighted an "urgent" need for further studies and prospective clinical trials to determine how AI systems can improve clinical practice and patient outcomes. "The rapid pace of improvement in LLMs has substantial implications for the science and practice of clinical medicine," wrote the authors, many of whom are based at Boston's Beth Israel Deaconess Medical Center, where the study was conducted. An accompanying article, also published in Science and written by two experts at Flinders Health and Medical Research Institute in Adelaide, Australia, who were not involved in the study, agreed with its urgent implications. They also argued against replacing doctors with AI, instead envisioning a style of collaboration that provides oversight, contextual judgment, and accountability. "Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use," the experts wrote.
[12]
AI nailed emergency diagnoses better than doctors in Harvard trials
AI beat doctors in emergency triage, but don't fire your physician yet AI has plenty of messy use cases, but emergency medicine may be one place where it can do some real good. A Harvard study comparing AI performance against doctors using patient data from emergency-room cases revealed that OpenAI's o1 reasoning model outperformed human doctors in emergency triage diagnosis, especially in cases where decisions had to be made quickly with limited information. What did the test reveal? A part of the Harvard trial included 76 patients who arrived at the emergency room of a Boston hospital. The AI model and two human doctors were given the same electronic health record, including basic details like vital signs, demographic information, and a short nurse-written note explaining why the patient had come in. Recommended Videos The AI managed to identify the exact or near-exact diagnosis 67% of cases. Meanwhile, the human doctors scored between 50% and 55%. In the second test, more detailed information was provided, which caused the AI's accuracy to rise to 82%. On the other hand, the humans scored between 70% and 79%. It is worth noting that this gap was not statistically significant. Why doctors aren't being replaced yet The premise of this study revolves around text-based medical reasoning, and not the full reality of emergency care. Researchers note that AI did not assess a patient's distress, appearance, tone, body language, or other real-world signals doctors use in the actual ER. Dr Adam Rodman, another lead author and a doctor at Boston's Beth Israel Deaconess Medical Center, said AI could become part of a "triadic care model" involving the doctor, patient, and AI system. While the results are impressive, the technology isn't ready to be dropped into emergency rooms just yet. Experts raised concerns over accountability, patient safety, AI errors, and whether doctors may start deferring too quickly to AI recommendations. As of right now, it can only be good enough to offer second opinion when doctors need one fast.
[13]
A major new study found AI outperformed doctors in ER diagnosis -- but there's a catch
"No one should look at this and say we do not need doctors," Rodman said in a call with reporters. At the same time, the researchers did argue that AI had reached the point where it could be a genuine asset for doctors in certain situations -- especially in the ER, where physicians are frequently dealing with imperfect information. They called for clinical trials that would properly assess the safety and efficacy of using AI for those tasks, serving as a second pair of virtual eyes that could act as a gut check for human physicians, or help them when they encounter a case that is outside their experience or expertise.
[14]
AI outperforms doctors in real ER tests, raising safety questions
In a recent study, an AI system beat physicians across a broad set of medical reasoning tests, including messy emergency room cases drawn from real records. The result pushes medical AI beyond exam success and toward the harder question of whether it can be tested safely in hospitals. In 76 emergency room records, the model faced scattered notes, missing details, and early decisions made before a diagnosis was confirmed. Arjun K. Manrai is an assistant professor who studies medical data at Harvard Medical School (HMS). By comparing the emergency room records with physician answers, Professor Manrai determined where the AI system held its advantage. That advantage lasted even before patients reached the clearer stage of hospital admission. Early uncertainty, not polished textbook cases, became the pressure point that made the result hard to ignore. At triage - the first sorting step in emergency care - the model named an exact or very close diagnosis in 67.1 percent of cases. After an emergency physician gathered more information, the rate rose to 72.4 percent, then reached 81.6 percent at admission. Both attending physicians, doctors who supervise patient care, improved as more facts arrived, but their early scores stayed lower than the AI scores. That gap made the first minutes of care the most telling part of the comparison. Since 1959, written diagnostic cases have helped doctors and computer scientists set medical AI benchmarks, standard tests for comparing systems. Multiple-choice scores then started losing meaning as newer models pushed near the top of old exams. "We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can't track progress anymore because we're already at the ceiling," said Dr. Peter G. Brodeur, one of the study's lead authors. Near-perfect scores forced the researchers to test whether success still held when real charts stayed messy. The scores were generated by a large language model - software trained to produce text from patterns in huge datasets. The system came from OpenAI's o1 series, a model family tested on step-by-step medical reasoning. Instead of choosing only one answer, it listed likely diagnoses and suggested the next move in care. That broader task brought the test closer to a doctor's daily work, though it remained limited to written information. Records at Beth Israel Deaconess Medical Center (BIDMC), a Boston teaching hospital, did not get cleaned before the model saw them. Real electronic health records - digital files that store patient care details - often mix old notes, repeated entries, and missing clues. "We didn't pre-process the data at all," said Dr. Adam Rodman, a clinical researcher at BIDMC. Messy inputs matter because small omissions can change which diagnosis looks urgent enough to chase first. Even a correct top diagnosis can send care sideways when the system asks for needless extra tests. Extra scans, blood work, or procedures can create false alarms, delays, cost, and physical risk. "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm," said Brodeur. Safety therefore depends on the whole recommendation - not only the first name on the diagnosis list. Clinical care runs on more than text, and this test did not measure everything doctors notice. Voices, breathing effort, posture, images, family concerns, and bedside changes can guide decisions before notes catch up. Current foundation models - broad AI systems trained for many tasks - still struggle more when sound and images carry the clues. That boundary keeps the result from becoming an argument for replacing clinicians at the bedside. Human comparison made the team's work stronger because the model did not compete only against older software. Hundreds of physicians supplied comparison points across case challenges, management plans, probability estimates, and emergency room second opinions. In the BIDMC real-record test, reviewers were blinded, meaning they did not know whether a diagnosis came from a human or model. That design reduced favoritism, but it could not show whether the tool improves live patient care. Strong benchmark scores now create a practical problem for hospitals, regulators, developers, and patients who need proof. Prospective clinical trials would test whether AI assistance changes patient outcomes during real visits. "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said Manrai. That level of performance makes it necessary to test how the system behaves in real care, where delays, overtesting, missed cues, and false confidence can shape patient outcomes. The message for medicine is not that machines replace doctors, but that written second opinions may soon become testable tools. Safe use will require doctors, engineers, and patients to judge accuracy, harm, speed, cost, and trust in the same breath. Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
[15]
Landmark Test of Clinical Reasoning Finds AI Outperformed | Newswise
AI can pass the hardest exams medical school has to offer. But can it handle the real world's inherent messiness? Harvard Medical School and Beth Israel Deaconess Medical Center researchers sought to find out. Newswise -- BOSTON - In one of the largest studies to compare artificial intelligence and physicians on a wide array of clinical reasoning tasks including real emergency department data, a team of physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center evaluated whether an AI system could do what physicians do every day: review a messy patient chart and use that information to determine diagnosis and next steps. In a new study published April 30, 2026 in Science, co-senior authors Arjun (Raj) Manrai, assistant professor of biomedical informatics at HMS and Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC and team report that a large language model (LLM) outperformed physicians across many common clinical reasoning tasks including emergency room decisions, identifying likely diagnoses, and choosing next steps in management. The LLM's performance indicated that long‑standing ways of testing medical AI may no longer capture current systems' performance, pointing to a possible turning point for the field. "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said co-senior author Manrai. "However, this does not mean AI will necessarily improve care -- how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice." "Models are increasingly capable," said Peter Brodeur, MD, MA, the study's co‑first author. "We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can't track progress anymore because we're already at the ceiling." Incorporating standards first created in the 1950s to train and evaluate doctors, the researchers compared how an AI system performed against hundreds of clinicians. The comparisons included case study diagnostic challenges, reasoning exercises, and real emergency department cases. In one of their experiments, the investigators tasked the LLM with evaluating patients at various points in a standard emergency department setting, ranging from early triage to later admission decisions. At each stage, the model was given only the information available at that point -- drawn directly from real‑world electronic health records -- and asked to generate likely diagnoses and suggest what should happen next. "To better understand real-world performance, we needed to test performance early in the patient course, when clinical data is sparse," said co-first author Thomas Buckley, Harvard Kenneth C. Griffin School of Arts and Science doctoral student, Dunleavy Fellow in HMS' AI in Medicine PhD program, and a member of Manrai's lab. Unlike many prior studies, the team did not smooth out the messiness of real‑world care before testing the AI. The emergency department cases were presented exactly as they appeared in the electronic health record. "We didn't pre‑process the data at all," Rodman said. "The model is literally just processing data as it exists in the health record." At the early decision points in the real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy. That result surprised even the researchers. "I thought it was going to be a fun experiment but that it wouldn't work that well," Rodman said. "That was not at all what happened." The results make the case that medical AI is ready to be studied the same way as all new medical interventions -- through carefully controlled clinical trials in real care settings. The researchers are clear that their results do not suggest that AI systems are ready to practice medicine autonomously, or that physicians can be removed from the diagnostic process. "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm," said Brodeur. "Humans should be the ultimate baseline when it comes to evaluating performance and safety." About Harvard Medical School Harvard Medical School brings together the brightest minds in science and medicine to improve health and well-being for all. The school and its affiliated hospitals and research institutions are home to 12,000 faculty members and 1,600 medical and graduate students. Together, they function as a magnet, pulling together the best and most passionate researchers, clinicians, students, and changemakers in science, medicine, and health. About Beth Israel Deaconess Medical Center Beth Israel Deaconess Medical Center is a leading academic medical center, where extraordinary care is supported by high-quality education and research. BIDMC is a teaching affiliate of Harvard Medical School and consistently ranks as a national leader among independent hospitals in National Institutes of Health funding. BIDMC is the official hospital of the Boston Red Sox. Beth Israel Deaconess Medical Center is a part of Beth Israel Lahey Health, a healthcare system that brings together academic medical centers and teaching hospitals, community and specialty hospitals, more than 4,700 physicians and 39,000 employees in a shared mission to expand access to great care and advance the science and practice of medicine through groundbreaking research and education.
[16]
Harvard Researchers Say Your Next ER Diagnosis May Come From AI -- and It Could Be More Accurate Than Human Doctors
What if the most accurate diagnosis in the emergency room didn't come from a doctor -- but AI? A new study led by a team of researchers at Harvard Medical School suggests that moment may already be here. The study, which was published in Science, a peer-reviewed journal, found that AI systems outperformed doctors in a number of emergency medicine situations. In diagnosing patients, the large language model (LLM) from OpenAI was more accurate than the human experts. This revelation comes as AI is already gaining traction in medicine, with nearly one in five U.S. physicians now using AI to assist with diagnosing patients, according to research published last month by the American Medical Association. "The rapid pace of improvement in LLMs has substantial implications for the science and practice of clinical medicine. Although applying AI to assist with clinical decision support is sometimes viewed as a high-risk endeavor, greater use of these tools might serve to mitigate the human and financial costs of diagnostic error, delay, and lack of access," the study authors wrote.
Share
Copy Link
A Harvard Medical School study published in Science reveals that OpenAI's o1 model achieved 67% diagnostic accuracy at emergency room triage, outperforming two attending physicians who scored 55% and 50%. The research tested AI against doctors using real patient data from Beth Israel Deaconess Medical Center, with blinded reviewers unable to distinguish AI output from human diagnoses. While the findings demonstrate AI's potential as a diagnostic tool, researchers emphasize the urgent need for clinical trials and accountability frameworks before deployment in actual patient care settings.
A groundbreaking study from Harvard Medical School and Beth Israel Deaconess Medical Center demonstrates that AI diagnosis capabilities now match or exceed physician performance in emergency medicine settings. Published in Science, the research tested OpenAI's o1 and 4o models against human doctors using 76 actual emergency room cases, marking a shift from theoretical assessments to authentic clinical evaluation
1
.
Source: Inc.
The results show the o1 model achieved the exact or very close diagnosis in 67% of triage cases, while two attending physicians scored 55% and 50% respectively
1
. This performance gap proved most pronounced at initial triage, the critical moment when information is scarcest and decisions carry the highest urgency. Blinded reviewers assessing the diagnoses could not distinguish between AI-generated and human recommendations3
.The research team conducted six experiments to measure clinical diagnostic reasoning across multiple scenarios. When tested on published clinicopathological conference cases, the o1-preview model achieved exact or very-close diagnostic accuracy in 88.6% of cases, substantially outperforming GPT-4 which scored 72.9%
2
. This advancement in large language model diagnostic accuracy represents a significant leap from earlier AI systems that primarily demonstrated proficiency on medical licensing examinations rather than real-world patient care.Arjun Manrai, who heads an AI lab at Harvard Medical School and serves as one of the study's lead authors, stated: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines"
1
. The researchers emphasized they did not pre-process patient data, presenting the AI with the same information available in electronic medical records at each diagnostic touchpoint1
.The study found AI as a diagnostic tool handled uncertainty far more effectively than human clinicians, particularly when working with fragmented or unstructured health data and notes
3
. Adam Rodman, a Beth Israel doctor and lead author, described a case where a patient with routine respiratory symptoms who had recently undergone organ transplant turned out to have a dangerous flesh-eating infection. "The model actually was suspicious of this [infection] from the very beginning, probably 12 to 24 hours before the human physician would have become suspicious"4
.Both human clinicians and AI improved as more patient data became available, but the model's advantage at early stages suggests potential for avoiding missed diagnoses with AI support
3
. This capability addresses one of emergency medicine's most challenging aspects: thinking of the correct diagnosis when information is limited and time is critical4
.
Source: Earth.com
Despite the impressive results showing AI outperforms ER doctors in specific contexts, researchers stress the technology should augment rather than replace physicians. "I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say, and how they're likely to use these results," Manrai said during a press briefing
3
. Rodman told the Guardian that patients "want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions"1
.Previous research on collaborative care models found no substantial difference between physicians augmented with GPT-4 and the GPT-4 model working alone, though both outperformed physicians with conventional resources
2
. This suggests determining optimal implementation will require evaluating AI alone, clinician alone, and clinician with AI configurations2
.Related Stories
The study identifies an urgent need for prospective clinical trials to evaluate AI in healthcare within real-world patient care settings
1
. Currently, "there's no formal framework right now for accountability" around AI diagnoses, according to Rodman1
. Researchers at Flinders University wrote in a Science commentary that "we do not allow doctors to practice without supervision and evaluation, and AI should be held to comparable standards"3
.The research carries important limitations. The models only processed text-based information, while actual emergency medicine relies heavily on visual and auditory cues from physical examinations
1
. The AI never saw patients, examined them, spoke to families, or took responsibility for outcomes5
. Future assessments must evaluate multimodal AI capabilities that process images, audio, and video alongside text2
.
Source: CNET
The urgency for establishing safety and equity standards intensifies as AI adoption accelerates. A Royal College of Physicians survey found 16% of UK doctors use AI tools in clinical practice daily, with another 15% using them weekly
5
. Globally, 1 in 5 doctors and nurses used AI for second opinions on complex cases as of 2025, with over half wanting to use it for this purpose4
.Doctors are integrating these tools into practice, sometimes without institutional oversight, before hospitals have established protocols for assessment, staff training, harm detection, or decision support accountability
2
. The gap between producing possible diagnoses and actually improving patient outcomes remains unclear, as longer diagnostic lists could generate unnecessary tests, over-treatment, or unwarranted confidence in plausible but incorrect answers5
. Regulators, hospitals, and healthcare providers must collaborate to test these tools thoroughly, ensuring they deliver care that is better, safer, and faster for all patients3
.Summarized by
Navi
[1]
[2]
[5]
01 Oct 2025•Health

05 Apr 2025•Health

18 Aug 2025•Health
