22 Sources
[1]
Microsoft Says Its New AI System Diagnosed Patients 4 Times More Accurately Than Human Doctors
The tech giant poached several top Google researchers to help build a powerful AI tool that can diagnose patients and potentially cut health care costs. Microsoft has taken "a genuine step towards medical superintelligence," says Mustafa Suleyman, CEO of the company's artificial intelligence arm. The tech giant says its powerful new AI tool can diagnose disease four times more accurately and at significantly less cost than a panel of human physicians. The experiment tested whether the tool could correctly diagnose a patient with an ailment, mimicking work typically done by a human doctor. The Microsoft team used 304 case studies sourced from the New England Journal of Medicine to devise a test called the Sequential Diagnosis Benchmark (SDBench). A language model broke down each case into a step-by-step process that a doctor would perform in order to reach a diagnosis. Microsoft's researchers then built a system called the MAI Diagnostic Orchestrator (MAI-DxO) that queries several leading AI models -- including OpenAI's GPT, Google's Gemini, Anthropic's Claude, Meta's Llama, and xAI's Grok -- in a way that loosely mimics several human experts working together. In their experiment, MAI-DxO outperformed human doctors, achieving an accuracy of 80 percent compared to the doctors' 20 percent. It also reduced costs by 20 percent by selecting less expensive tests and procedures. "This orchestration mechanism -- multiple agents that work together in this chain-of-debate style -- that's what's going to drive us closer to medical superintelligence," Suleyman says. The company poached several Google AI researchers to help with the effort -- yet another sign of an intensifying war for top AI expertise in the tech industry. Suleyman was previously an executive at Google working on AI. AI is already widely used in some parts of the US health care industry, including helping radiologists interpret scans. The latest multimodal AI models have the potential to act as more general diagnostic tools, though the use of AI in health care raises its own issues, particularly related to bias from training data that's skewed toward particular demographics. Microsoft has not yet decided if it will try to commercialize the technology, but the same executive, who spoke on the condition of anonymity, said the company could integrate it into Bing to help users diagnose ailments. The company could also develop tools to help medical experts improve or even automate patient care. "What you'll see over the next couple of years is us doing more and more work proving these systems out in the real world," Suleyman says. The project is the latest in a growing body of research showing how AI models can diagnose disease. In the last few years, both Microsoft and Google have published papers showing that large language models can accurately diagnose an ailment when given access to medical records. The new Microsoft research differs from previous work in that it more accurately replicates the way human physicians diagnose disease -- by analyzing symptoms, ordering tests, and performing further analysis until a diagnosis is reached. Microsoft describes the way that it combined several frontier AI models as "a path to medical superintelligence," in a blog post about the project today. The project also suggests that AI could help lower health care costs, a critical issue, particularly in the US. "Our model performs incredibly well, both getting to the diagnosis and getting to that diagnosis very cost effectively," says Dominic King, a vice president at Microsoft who is involved with the project.
[2]
Microsoft AI system diagnoses complex cases better than human doctors - and for less money
Research on AI for medicine looks increasingly promising -- the tech already speeds up drug development, Google is using AI to improve its medical advice, and wearable companies are leveraging the technology for predictive health features. Now, Microsoft is the latest to move the goal post. On Monday, the company announced in a blog post that Microsoft AI Diagnostic Orchestrator (MAI-DxO), its medical AI system, successfully diagnosed 85% of cases in the New England Journal of Medicine (NEJM). This rate of diagnosis is more than four times higher than human physicians. NEJM cases are particularly complex and often require several specialists. Also: OpenAI's HealthBench shows AI's medical advice is improving - but who will listen? Given how inaccessible, complex, and confusing healthcare systems continue to be, it's no surprise people are seeking help from technology wherever possible. "Across Microsoft's AI consumer products like Bing and Copilot, we see over 50 million health-related sessions every day," Microsoft said in the announcement. "From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare." Human physicians must pass the US Medical Licensing Examination (USMLE) to practice medicine, a test that's also used to evaluate how AI systems perform in medical contexts, both model-to-model and when compared with humans. Currently, AI scores well on the USMLE -- a side effect, Microsoft said, of the models memorizing (rather than understanding) answers to multiple-choice questions, which won't produce the most sound medical analysis. Most industry-standard AI benchmarks have been saturated for a while, meaning AI models are evolving too quickly for the tests to be usefully challenging. To combat this issue, Microsoft created the Sequential Diagnosis Benchmark (SD Bench). Sequential diagnosis is a process real clinicians use to diagnose patients by beginning with how their symptoms present and proceeding with questions and tests from there. The test presents diagnostic challenges from 304 NEJM cases, which humans and AI models can use to ask questions. Also: Anthropic says Claude helps emotionally support users - we're not convinced Microsoft then paired the diagnostic agent, MAI-DxO, with several frontier models, including GPT, Llama, Claude, Gemini, Grok, and DeepSeek, and put the agent to the SD Bench test. MAI-DxO turns whatever LLM it is using into a "virtual panel of physicians with diverse diagnostic approaches collaborating to solve diagnostic cases," Microsoft explained. In a video demo, MAI-DxO also shows its reasoning as it queries the benchmark, develops possible diagnoses, and tracks the cost of each requested test. Once the agent has the required information from the benchmark about the case, it changes its diagnoses, asking for different scans and displaying a diagnostic process much more familiar to human physicians. "MAI-DxO boosted the diagnostic performance of every model we tested," said Microsoft's blog post, noting that the system performed best when paired with OpenAI's o3 model. The company compared the results to those of 21 physicians from the UK and the US with experience ranging from five to 20 years, who reached a mean accuracy of just 20%. Also: You shouldn't trust AI for therapy - here's why Microsoft noted that MAI-DxO is also configurable, meaning it can run within cost limitations set by a user or organization -- a feature that lets the agent run a cost-benefit analysis of certain tests, which is highly relevant to the astronomical pricing of US medical care and something human doctors and patients have to consider as well. This feature is also a guardrail, of sorts -- without it, the AI might "default to ordering every possible test -- regardless of cost, patient discomfort, or delays in care," the blog post explained. MAI-DxO also returned higher accuracy and lower costs than individual models or human physicians. Probably not anytime soon -- though Microsoft's blog post noted that because of its breadth of knowledge, AI can surpass "clinical reasoning capabilities that, across many aspects of clinical reasoning, exceed those of any individual physician." The company believes systems like this one can "reshape healthcare" by giving patients the option to check themselves reliably and help doctors with complex cases. The cost savings would be another plus for an industry constantly plagued by inexplicably high costs and opaque pricing structures. Also: AI is relieving therapists from burnout. Here's how it's changing mental health Microsoft conceded that MAI-DxO has only been tested on these special cases, so it's unclear how it would handle everyday tasks. However, this issue may not be relevant anyway if the agent isn't intended to replace human doctors, which Microsoft also maintained in the blog post. MAI-DxO is part of a "dedicated consumer health effort" Microsoft AI initiated last year, the company said in the release. Other AI products within that initiative include RAD-DINO, a radiology workflow tool, and Microsoft Dragon Copilot, a voice AI assistant designed for medical professionals.
[3]
Can AI outdiagnose doctors? Microsoft's tool is 4 times better for complex cases
On Monday, the company announced in a blog post that Microsoft AI Diagnostic Orchestrator (MAI-DxO), its medical AI system, successfully diagnosed 85% of cases in the New England Journal of Medicine (NEJM). This rate of diagnosis is more than four times higher than human physicians. NEJM cases are particularly complex and often require several specialists. Also: OpenAI's HealthBench shows AI's medical advice is improving - but who will listen? Given how inaccessible, complex, and confusing healthcare systems continue to be, it's no surprise people are seeking help from technology wherever possible. "Across Microsoft's AI consumer products like Bing and Copilot, we see over 50 million health-related sessions every day," Microsoft said in the announcement. "From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare." Human physicians must pass the US Medical Licensing Examination (USMLE) to practice medicine, a test that's also used to evaluate how AI systems perform in medical contexts, both model-to-model and when compared with humans. Currently, AI scores well on the USMLE -- a side effect, Microsoft said, of the models memorizing (rather than understanding) answers to multiple-choice questions, which won't produce the most sound medical analysis. Most industry-standard AI benchmarks have been saturated for a while, meaning AI models are evolving too quickly for the tests to be usefully challenging. To combat this issue, Microsoft created the Sequential Diagnosis Benchmark (SD Bench). Sequential diagnosis is a process real clinicians use to diagnose patients by beginning with how their symptoms present and proceeding with questions and tests from there. The test presents diagnostic challenges from 304 NEJM cases, which humans and AI models can use to ask questions. Also: Anthropic says Claude helps emotionally support users - we're not convinced Microsoft then paired the diagnostic agent, MAI-DxO, with several frontier models, including GPT, Llama, Claude, Gemini, Grok, and DeepSeek, and put the agent to the SD Bench test. MAI-DxO turns whatever LLM it is using into a "virtual panel of physicians with diverse diagnostic approaches collaborating to solve diagnostic cases," Microsoft explained. In a video demo, MAI-DxO also shows its reasoning as it queries the benchmark, develops possible diagnoses, and tracks the cost of each requested test. Once the agent has the required information from the benchmark about the case, it changes its diagnoses, asking for different scans and displaying a diagnostic process much more familiar to human physicians. "MAI-DxO boosted the diagnostic performance of every model we tested," said Microsoft's blog post, noting that the system performed best when paired with OpenAI's o3 model. The company compared the results to those of 21 physicians from the UK and the US with experience ranging from five to 20 years, who reached a mean accuracy of just 20%. Also: You shouldn't trust AI for therapy - here's why Microsoft noted that MAI-DxO is also configurable, meaning it can run within cost limitations set by a user or organization -- a feature that lets the agent run a cost-benefit analysis of certain tests, which is highly relevant to the astronomical pricing of US medical care and something human doctors and patients have to consider as well. This feature is also a guardrail, of sorts -- without it, the AI might "default to ordering every possible test -- regardless of cost, patient discomfort, or delays in care," the blog post explained. MAI-DxO also returned higher accuracy and lower costs than individual models or human physicians. Probably not anytime soon -- though Microsoft's blog post noted that because of its breadth of knowledge, AI can surpass "clinical reasoning capabilities that, across many aspects of clinical reasoning, exceed those of any individual physician." The company believes systems like this one can "reshape healthcare" by giving patients the option to check themselves reliably and help doctors with complex cases. The cost savings would be another plus for an industry constantly plagued by inexplicably high costs and opaque pricing structures. Also: AI is relieving therapists from burnout. Here's how it's changing mental health Microsoft conceded that MAI-DxO has only been tested on these special cases, so it's unclear how it would handle everyday tasks. However, this issue may not be relevant anyway if the agent isn't intended to replace human doctors, which Microsoft also maintained in the blog post. MAI-DxO is part of a "dedicated consumer health effort" Microsoft AI initiated last year, the company said in the release. Other AI products within that initiative include RAD-DINO, a radiology workflow tool, and Microsoft Dragon Copilot, a voice AI assistant designed for medical professionals.
[4]
Microsoft unveils AI diagnosis tool in effort to transform medicine
Microsoft has built an artificial intelligence-powered medical tool it claims is four times more successful than human doctors at diagnosing complex ailments, as the tech giant unveils research it believes could speed up treatment and save money by reducing unnecessary tests. The "Microsoft AI Diagnostic Orchestrator" is the first initiative to come out of an AI health unit formed last year by Mustafa Suleyman with staff poached from DeepMind, the research lab he co-founded and which is now owned by rival Google. In an interview with the Financial Times, the chief executive of Microsoft AI said the trial was a step on the path to "medical superintelligence" that could help solve staffing crises and long waiting times for overstretched health systems. A version of the technology could soon also be deployed in Microsoft's Copilot AI chatbot and Bing search engine, which handle 50mn health queries a day. We are nearing "AI models that are not just a little bit better, but dramatically better, than human performance: faster, cheaper and four times more accurate," said Suleyman. "That is going to be truly transformative." Suleyman's new effort comes after Deepmind has led the way on AI-related heathcare breakthroughs. The Google lab's chief Sir Demis Hassabis jointly won a chemistry Nobel Prize last year for using AI to unlock the biological secrets of proteins that underpin life. Microsoft's new system is underpinned by a so-called "orchestrator" that creates virtual panels of five AI agents acting as "doctors" -- each with a distinct role, such as coming up with hypotheses or choosing diagnostic tests -- which interact and "debate" together to choose a course of action. To test its capabilities, "MAI-DxO" was fed 304 studies from the New England Journal of Medicine (NEJM) that describe how some of the most complicated cases were solved. This allowed researchers to test if the programme could figure out the correct diagnosis and monitor the decisions it made, using a new technique called "chain of debate," that makes AI reasoning models show how they solve problems step-by-step. Microsoft used leading large language models from OpenAI, Meta, Anthropic, Google, xAI and DeepSeek. The orchestrator made all LLMs perform better, but worked best with OpenAI's o3 reasoning model to correctly solve 85.5 per cent of the NEJM cases. That compared with about 20 per cent by experienced human doctors, but those physicians were not allowed access to textbooks or to ask colleagues in the trial, which could have increased their success rate. Microsoft has invested almost $14bn into OpenAI and has exclusive rights to use and sell its technology. However, the tech giant is embroiled in high-stakes brinkmanship with the start-up, which is attempting to convert into a for-profit entity, with both sides clashing over the future terms of their partnership. Suleyman said that while OpenAI's model performed the best, Microsoft was "agnostic" over which of the four "world-class models" MAI-DxO used. "We have long believed that they'll become commodities . . . it's the aggregate orchestrator which I think is the differentiator," he said. Dominic King, the former head of DeepMind's health unit who joined Microsoft late last year, said that the programme had "performed better than anything we've ever seen before" and that "there is an opportunity here today to act almost as a new front door to healthcare". The AI models were also prompted to be cost-conscious, which significantly cut the number of tests required to get to a correct diagnosis in the trial, saving hundreds of thousands of dollars in some cases, he said. However, King stressed that the technology was still in its early stages, had not been peer reviewed and was not yet ready for a clinical environment. "This is a landmark study," said Eric Topol, a cardiologist and founder and director of the Scripps Research Translational Institute. "While this work was not done in the setting of real world medical practice, it is the first to provide evidence for the efficiency potential of generative AI in medicine -- accuracy and cost savings."
[5]
AI system matches diagnostic accuracy while cutting medical costs
By Dr. Priyom Bose, Ph.D.Reviewed by Lauren HardakerJul 2 2025 In a new study, Microsoft's AI-powered diagnostic system outperformed experienced doctors in solving the most challenging medical cases faster, cheaper, and more accurately. Study: Sequential Diagnosis with Language Models. Image credit: metamorworks/Shutterstock.com *Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information. A recent study on the ArXiv preprint server compared the diagnostic accuracy and resource expenditure of AI systems with those of clinicians regarding complex cases. The Microsoft AI team demonstrated the efficient use of artificial intelligence (AI) in medicine to tackle diagnostic challenges that physicians struggle to decipher. Sequential diagnosis and language models Often, physicians diagnose patients for an ailment through a clinical reasoning process that involves step-by-step, iterative questioning and testing. Even with limited initial information, clinicians narrow down the possible diagnosis by questioning the patient and confirming through biochemical tests, imaging, biopsy, and other diagnostic procedures. Solving a complex case requires a wide-ranging set of skills, including determining the most critical following questions or tests, staying aware of test costs to prevent increasing patient burden, and recognizing evidence to make a confident diagnosis. Multiple studies have demonstrated the enhanced efficiency of language models (LMs) in performing in medical licensing exams and highly structured diagnostic vignettes. However, the performance of most LMs was evaluated under artificial conditions, which drastically differ from real-world clinical settings. Most LMs models for diagnostic assessments are based on a multiple-choice quiz, and the diagnosis is made from a predefined answer set. A reduced sequential diagnosis cycle increases the risk of overstating static benchmarks' model competence. Furthermore, these diagnostic models present the risk of indiscriminate test ordering and premature diagnostic closure. Therefore, there is an urgent need for an AI system based on a sequential diagnosis cycle to improve diagnostic accuracy and reduce test costs. About the study To overcome the above-stated drawbacks of LMs models for clinical diagnosis, scientists have developed the Sequential Diagnosis Benchmark (SDBench) as an interactive framework to evaluate diagnostic agents (human or AI) through realistic sequential clinical encounters. To assess diagnostic accuracy, the current study utilized weekly cases published in The New England Journal of Medicine (NEJM), the world's leading medical journal. This journal typically publishes case records of patients from Massachusetts General Hospital in a detailed, narrative format. These cases are among the most diagnostically challenging and intellectually demanding in clinical medicine, often requiring multiple specialists and diagnostic tests to confirm a diagnosis. SDBench recast 304 cases from the 2017- 2025 NEJM clinicopathological conference (CPC) into stepwise diagnostic encounters. The medical data spanned clinical presentations to final diagnoses, ranging from common conditions (e.g., pneumonia) to rare disorders (e.g., neonatal hypoglycemia). Using the interactive platform, diagnostic agents decide which questions to ask, which tests to order, and when to confirm a diagnosis. Information Gatekeeper is a language model that selectively discloses clinical details from a comprehensive case file only when explicitly queried. It can also provide additional case-consistent information for tests not described in the original CPC narrative. After making the final diagnosis based on information obtained from the Gatekeeper, the accuracy of the clinical evaluation was tested against the real diagnosis. In addition, the cumulative cost of all requested diagnostic tests conducted in real-world diagnosis was estimated. By evaluating diagnostic accuracy and cost, SDBench indicates how close we are to high-quality care at a sustainable cost. Study findings The current study analyzed the performance of all diagnostic agents on the SDBench. AI agents were evaluated on all 304 NEJM cases, while physicians were assessed on a held-out subset of 56 test-set cases. This study observed that AI agents performed better on this subset than physicians. Physicians practicing in the USA and UK with a median of 12 years of clinical experience achieved 20% diagnostic accuracy at an average cost of $2,963 per case on SDBench, highlighting the benchmark's inherent difficulty. Physicians spent an average of 11.8 minutes per case, requesting 6.6 questions and 7.2 tests. GPT -4o outperformed physicians in terms of both diagnostic accuracy and cost. Commercially available off-the-shelf models offered varied diagnostic accuracy and cost. The current study also introduced the MAI Diagnostic Orchestrator (MAI-DxO), a platform co-designed with physicians, which exhibited higher diagnostic efficiency than human physicians and commercial language models. Compared to commercial LMs, MAI-DxO demonstrated higher diagnostic accuracy and a significant reduction in medical costs of more than half. For instance, the off-the-shelf O3 model achieved diagnostic accuracy of 78.6% for $7,850, while MAI-DxO achieved 79.9% accuracy at just $2,397, or 85.5% at $7,184. MAI-DxO accomplished this by simulating a virtual panel of "doctor agents" with different roles in hypothesis generation, test selections, cost-consciousness, and error checking. Unlike baseline AI prompting, this structured orchestration allowed the system to reason iteratively and efficiently. MAI-DxO is a model-agnostic approach that has demonstrated accuracy gains across various language models, not just the O3 foundation model. Conclusions and future outlooks The current study's findings demonstrate AI systems' higher diagnostic accuracy and cost-effectiveness when guided to think iteratively and act judiciously. SDBench and MAI-DxO provided an empirically grounded foundation for advancing AI-assisted diagnosis under realistic constraints. In the future, MAI-DxO must be validated in clinical environments, where disease prevalence and presentation occur as frequently as daily, rather than as a rare occasion. Furthermore, large-scale interactive medical benchmarks involving more than 304 cases are required. Incorporation of visual and other sensory modalities, such as imaging, could also enhance diagnostic accuracy without compromising cost efficiency. However, the authors note important limitations. NEJM CPC cases are selected for their difficulty and do not reflect everyday clinical presentations. The study did not include healthy patients or measure false positive rates. Moreover, diagnostic cost estimates are based on U.S. pricing and may vary globally. The models were also tested on a held-out test set of recent cases (2024-2025) to assess generalization and avoid overfitting, as many of these cases were published after the training cutoff for most models. The paper also raises a broader question: Should we compare AI systems to individual physicians or full medical teams? Since MAI-DxO mimics multi-specialist collaboration, the comparison may reflect something closer to team-based care than individual practice. Nonetheless, the research suggests that structured AI systems like MAI-DxO may one day support or augment clinicians, particularly in settings where specialist access is limited or expensive. Download your PDF copy now! *Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information. Journal reference: Preliminary scientific report. Nori, H. et al. (2025) Sequential Diagnosis with Language Models. ArXiv. https://arxiv.org/abs/2506.22405 https://arxiv.org/abs/2506.22405
[6]
AI vs. MDs: Microsoft AI tool outperforms doctors in diagnosing complex medical cases
Microsoft today announced an artificial intelligence tool that outperformed a panel of medical doctors in diagnosing complicated cases. The Microsoft AI Diagnostic Orchestrator (MAI-DxO) faced off against 21 experienced physicians from the U.S. and United Kingdom presented with complex cases documented in the New England Journal of Medicine. MAI-DxO gave a correct diagnosis for 85.5% of the test cases while the doctors hit the mark 20% of the time. Microsoft used the tool with well-known AI models including GPT, Llama, Claude, Gemini, Grok and DeepSeek. The best setup was MAI-DxO paired with OpenAI's o3. Like a human physician, MAI-DxO diagnoses by analyzing symptoms, posing questions, and recommending medical tests. A key feature is its ability to optimize costs, preventing the ordering of superfluous diagnostics that contribute to overspending in healthcare. While MAI-DxO outperformed the medical providers, the tech company acknowledged that under normal conditions the doctors would not be operating in isolation and would be able to consult colleagues and online and print resources. The new diagnostic performance benchmark was created from 304 recent cases documented by the New England Journal of Medicine. The tool builds on earlier tests of AI performance in medicine that quizzed the bots on the U.S. Medical Licensing Examination (USMLE) standardized test. Microsoft said AI tools have evolved to get nearly perfect scores on the test, but their multiple-choice structure favors "memorization over deep understanding." The new benchmark using complicated cases requires higher-order skills to perform "sequential diagnosis, a cornerstone of real-world medical decision making," according to Microsoft's blog post. Next steps for developing the tool for public use include testing its abilities against more commonplace ailments. Before it could be deployed in healthcare practices, it would require testing in a clinical setting for safety and performance and approval from regulators. As it has with other AI applications, Microsoft emphasized that the bots are not meant to replace people but to optimize their output, in this case automating routine tasks, assisting in diagnosis and creating personalized care strategies.
[7]
AI's Diagnostic Skills Are Now Comparable To Non-Expert Physicians
AsianScientist (Jul. 02, 2025) - Artificial Intelligence (AI) is revolutionizing various sector including medicine, particularly disease diagnoses. Studies have explored how well AI models can interpret clinical data, analyze patient histories, and suggest diagnoses. Research is beginning to map out where these models excel and where they fall short. However, there is a lack of a comprehensive meta-analysis comparing the diagnostic performance of generative AI models with that of physicians. Such a comparison is essential to understand how well these models perform in real-world clinical settings. Although individual studies have offered important insights, a systematic review was essential to bring together the findings and evaluate how these models measure up against traditional diagnoses made by physicians. Researchers at Osaka Metropolitan University's Graduate School of Medicine conducted a meta-analysis of generative AI's diagnostic capabilities using 83 research papers published between June 2018 and June 2024, covering a wide range of medical specialties. Among the large language models (LLMs) analyzed, ChatGPT was the most studied. The comparative evaluation revealed that medical specialists had a 15.8 percent higher diagnostic accuracy than generative AI. The average diagnostic accuracy of generative AI was 52.1 percent, with the latest models occasionally demonstrating accuracy comparable to that of non-specialist doctors. The findings were published in npj Digital Medicine. "This research shows that generative AI's diagnostic capabilities are comparable to non-specialist doctors. It could be used in medical education to support non-specialist doctors and assist in diagnostics in areas with limited medical resources," said Hirotaka Takita, lecturer at the Department of Diagnostic and Interventional Radiology, Osaka Metropolitan University, and an author of the study. "Further research, such as evaluations in more complex clinical scenarios, performance evaluations using actual medical records, improving the transparency of AI decision-making, and verification in diverse patient groups, is needed to verify AI's capabilities," he added. Research that contrasts the performance of generative AI with that of physicians provides valuable information for medical training. Expert doctors are still much more accurate than AI at present, highlighting the importance of human judgment. However, since AI performs at a level comparable to non-expert doctors, it could be a valuable tool in training medical students and residents. The paper noted that AI could help simulate real-life cases, provide feedback, and support learning through practice. You can find the article at A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians Disclaimer: This article does not necessarily reflect the views of AsianScientist or its staff.
[8]
Microsoft says AI system better than doctors at diagnosing complex health conditions
Firm says results of research create 'path to medical superintelligence' but plays down job implications Microsoft has revealed details of an artificial intelligence system that performs better than human doctors at complex health diagnoses, creating a "path to medical superintelligence". The company's AI unit, which is led by the British tech pioneer Mustafa Suleyman, has developed a system that imitates a panel of expert physicians tackling "diagnostically complex and intellectually demanding" cases. Microsoft said that when paired with OpenAI's advanced o3 AI model, its approach "solved" more than eight of 10 case studies specially chosen for its research. When those case studies were tried on practising physicians - who had no access to colleagues, textbooks or chatbots - the accuracy rate was two out of 10. Microsoft said it was also a cheaper option than using human doctors because it was more efficient at ordering tests. Despite highlighting the potential cost savings from its research, Microsoft played down the job implications, saying it believed AI would complement doctors' roles rather than replace them. "Their clinical roles are much broader than simply making a diagnosis. They need to navigate ambiguity and build trust with patients and their families in a way that AI isn't set up to do," the company wrote in a blogpost announcing the research, which is being submitted for peer review. However, using the slogan "path to medical superintelligence" raises the prospect of radical change in the healthcare market. While artificial general intelligence (AGI) refers to systems that match human cognitive abilities at any given task, superintelligence is an equally theoretical term referring to a system that exceeds human intellectual performance across the board. Explaining the rationale behind the research, Microsoft raised doubt over AI's ability to score exceptionally well in the United States Medical Licensing Examination, a key test for obtaining a medical licence in the US. It said the multiple-choice tests favoured memorising answers over deep understanding of a subject, which could help "overstate" the competence of an AI model. Microsoft said it was developing a system that, like a real-world clinician, takes step-by-step measures - such as asking specific questions and requesting diagnostic tests - to arrive at a final diagnosis. For instance, a patient with symptoms of a cough and fever may require blood tests and a chest X-ray before the doctor arrives at a diagnosis of pneumonia. The new Microsoft approach uses complex case studies from the New England Journal of Medicine (NEJM). Suleyman's team transformed more than 300 of these studies into "interactive case challenges" that it used to test its approach. Microsoft's approach used existing AI models, including those produced by ChatGPT's developer, OpenAI, Mark Zuckerberg's Meta, Anthropic, Elon Musk's Grok and Google's Gemini. Microsoft then used a bespoke, agent-like AI system called a "diagnostic orchestrator" to work with a given model on what tests to order and what the diagnosis might be. The orchestrator in effect imitates a panel of physicians, which then comes up with the diagnosis. Microsoft said that when paired with OpenAI's advanced o3 model, it "solved" more than eight of 10 NEJM case studies - compared with a two out of 10 success rate for human doctors. Microsoft said its approach was able to wield a "breadth and depth of expertise" that went beyond individual physicians because it could span multiple medical disciplines. It added: "Scaling this level of reasoning - and beyond - has the potential to reshape healthcare. AI could empower patients to self-manage routine aspects of care and equip clinicians with advanced decision support for complex cases." Microsoft acknowledged its work is not ready for clinical use. Further testing is needed on its "orchestrator" to assess its performance on more common symptoms, for instance.
[9]
AI doctor four times better at identifying illnesses than humans
Microsoft has developed an artificial intelligence (AI) system that it claims is four times better than doctors at diagnosing complex illnesses. The tech company's AI diagnosis system was able to correctly identify ailments up to 86pc of the time, compared to just 20pc on average for British and American physicians. Announcing the findings, Microsoft claimed it had laid the groundwork for "medical superintelligence". It comes as Wes Streeting, the Health Secretary, is seeking to bring AI into widespread use in the NHS to improve efficiency. In April the NHS waiting list rose for the first time in seven months, reaching 7.42m in a blow to one of the Government's key pledges to cut waiting times. Microsoft claimed its system could solve problems more cheaply than doctors - beating physicians even when sticking to a budget for diagnostic tests. The system, known as Microsoft AI Diagnostic Orchestrator, or MAI-DxO, was tested on 304 cases from the New England Journal of Medicine, a medical journal known for publishing complex medical cases from Massachusetts General Hospital. The system comprised a virtual panel of five different AI bots, each serving different roles such as "Dr Hypothesiser", "Dr Test-Chooser" and "Dr Challenger" that would internally deliberate before asking further questions or ordering tests and providing a diagnosis. In one case, the system diagnosed embryonal rhabdomyosarcoma, a rare form of cancer that normally occurs in children, in a 29-year-old woman.
[10]
Microsoft's new medical AI cracks the hardest diagnoses
Microsoft says its diagnostic health care tool is not only making house calls, it's doing a better job than physicians. The tech giant's AI Diagnostic Orchestrator (MAI-DxO) tool "correctly diagnoses up to 85% of [New England Journal of Medicine] case proceedings, a rate more than four times higher than a group of experienced physicians," Microsoft wrote in a research post Monday. Microsoft's new Diagnostic Orchestrator experimental AI system is designed to mimic a virtual panel of physicians, integrating multiple large language models (LLMs) such as GPT, Claude, Gemini, Grok, and others, bundling pertinent data into distinct patient diagnoses. Microsoft said its AI doctor agent can provide hypotheses, suggest tests, challenge assumptions, enforce cost efficiency, and conduct quality control, with patient care as a priority. "The Microsoft AI team shares research that demonstrates how AI can sequentially investigate and solve medicine's most complex diagnostic challenges. (These are) cases that expert physicians struggle to answer," the company reported. MAI-DxO approaches medical patient diagnosis differently than human doctors, taking a sequential step-by-step route that starts with limited patient data, then proceeds to targeted queries and patient testing before providing a diagnosis. "We're taking a big step towards medical superintelligence," Microsoft AI chief executive officer Mustafa Suleyman noted on LinkedIn. "AI models have aced multiple-choice medical exams - but real patients don't come with ABC answer options." The sheer breadth and scope of the AI-powered MAI-DxO diagnostic tool may send shockwaves through the patient care community. "Increasingly, people are turning to digital tools for medical advice and support," the report stated. "Across Microsoft's AI consumer products like Bing and Copilot, we see over 50 million health-related sessions every day. From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare." Microsoft also noted its aggressive move into "broader health initiatives" that signal a renewed commitment to quality patient care through "an expanding AI-based product line." "Existing solutions include RAD-DINO, which helps accelerate and improve radiology workflows and Microsoft Dragon Copilot, our pioneering voice-first AI assistant for clinicians," the company said. "For AI to make a difference, clinicians and patients alike must be able to trust its performance. That's where our new benchmarks and AI orchestrator come in." U.S. physicians are already using AI in growing numbers, with 72% of employed physicians and 64% of doctors in private practice deploying the technology, according to the American Medical Association. So is it time for physicians to contemplate retirement? Not yet. But in February, Microsoft founder Bill Gates told Jimmy Fallon on The Tonight Show that he believed AI could replace doctors "over the next decade."
[11]
Microsoft's AI Is Better Than Doctors at Diagnosing Disease
The company reports in a study published on the preprint site arXiv that its AI-based medical program, the Microsoft AI Diagnostic Orchestrator (MAI-DxO), correctly diagnosed 85% of cases described in the New England Journal of Medicine. That's four times higher than the accuracy rate of human doctors, who came up with the right diagnoses about 20% of the time. The cases are part of the journal's weekly series designed to stump doctors: complicated, challenging scenarios where the diagnosis isn't obvious. Microsoft took about 300 of these cases and compared the performance of its MAI-DxO to that of 21 general-practice doctors in the U.S. and U.K. In order to mimic the iterative way doctors typically approach such cases -- by collecting information, analyzing it, ordering tests, and then making decisions based on those results -- Microsoft's team first created a stepwise decision-making benchmark process for each case study. This allowed both the doctors and the AI system to ask questions and make decisions about next steps, such as ordering tests, based on the information they learned at each step -- similar to a flow chart for decision-making, with subsequent questions and actions based on information gleaned from previous ones. The 21 doctors were compared to a pooled set of off-the-shelf AI models that included Claude, DeepSeek, Gemini, GPT, Grok, and Llama. To further mirror the way human doctors approach such challenging cases, the Microsoft team also built an Orchestrator: a virtual emulation of the sounding board of colleagues and consultations that physicians often seek out in complex cases.
[12]
Microsoft claims its AI tool can diagnose complex medical cases four times more accurately than doctors
One of the world's global powerhouses announced what could be a big win for the AI economy. In a blog post, Microsoft said its AI diagnostic tool -- the Microsoft AI Diagnostic Orchestrator (MAI-DxO), which simulates a panel of physicians and is trained using the standard Medical Licensing Examination -- diagnosed cases four times as accurately as physicians after both parties were able to ask questions, order tests, and, eventually, finalize a diagnosis. In the post -- written by Harsha Nori, head of AI at Microsoft AI Health, and Dominic King, VP of Health at Microsoft AI -- the company claimed its AI diagnosed 85% of over 300 real-world cases correctly, and that the model's process "gets to the correct diagnosis more cost-effectively than physicians." Microsoft claims MAI-DxO "can blend both breadth and depth of expertise, demonstrating clinical reasoning capabilities that, across many aspects of clinical reasoning, exceed those of any individual physician." AI is already rapidly evolving across the health care ecosystem. According to Microsoft, over 50 million health-related sessions occur daily using Microsoft's AI consumer products. "From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare," the blog post said. Beyond being a sounding board for health questions, AI is also expanding into physical clinics. With staffing shortages, long wait times, and a total of $5 trillion in annual health care expenditures, the industry is ripe for technological advancements. In diagnostics, a separate study found couples in distress can derive similar mental-health benefits from AI therapy as they can from human therapists. However, there is still hesitancy about how the AI will be implemented, the accumulation of sensitive data, and, of course, the future of the doctor. Nearly half of U.S. patients (48%) and 63% of clinicians are optimistic that AI can improve health outcomes, according to research from the 2025 Philips Future Health Index (FHI). It's undeniable that minding this gap and building optimism among consumers, particularly those who may not trust traditional health care, is key to building and scaling new technological solutions. "Breakthroughs need trust for real-world impact," Dominic King, who co-wrote the blog post, told Fortune in a statement. "That's why we're committed to earning the trust of health care professionals and patients through rigorous safety testing, clinical validation, and regulatory reviews." Microsoft said it views the technology as a "complement to doctors and other health professionals" and emphasized doctors' ability "to navigate ambiguity and build trust with patients and their families" is not something AI can replicate. "Doctors aren't going anywhere. AI will help them arrive at diagnoses and effective care plans faster, but it can't replace the human connection and understanding patients' needs," King said. The team at Microsoft noted the limitations of this research. For one, the physicians in the study had between five and 20 years of experience, but were unable to use textbooks, coworkers, or -- ironically -- generative AI for their answers. It could have limited their performance, as these resources may typically be available during a complex medical situation. Moreover, the doctors and AI analyzed only complex cases and not everyday ones. "Important challenges remain before generative AI can be safely and responsibly deployed across healthcare," the team wrote. "We need evidence drawn from real clinical environments, alongside appropriate governance and regulatory frameworks to ensure reliability, safety, and efficacy."
[13]
Paging Dr. Algorithm: Microsoft's AI Diagnoses Like House, Bills Like Costco - Decrypt
The Microsoft CEO announced two healthcare AI advances on social media this week, including MAI-DxO, a system that simulates multiple virtual doctors working together to solve medical mysteries. In testing against 304 complex cases from the New England Journal of Medicine, Microsoft reported that the AI correctly diagnosed 85.5% of them. A group of 21 experienced physicians tackling the same cases? They got 20% right. "Excited to share two advances that bring us closer to real-world impact in healthcare AI," Nadella wrote. "MAI-DxO is a model-agnostic orchestrator that simulates a panel of virtual physicians. It achieves 85.5% diagnostic accuracy -- four times that of experienced doctors -- while cutting diagnostic costs." The announcement comes as Microsoft joins a crowded field of tech companies racing to apply AI to healthcare's thorniest problems. With Americans spending nearly $5 trillion annually on healthcare -- and diagnostic errors affecting 12 million people each year, according to Johns Hopkins University -- the idea of using AI to address human-related issues seems like a no-brainer. MAI-DxO works like a medical dream team trapped in a computer. The system tackles cases through what Microsoft calls the Sequential Diagnosis Benchmark, or SDBench. Instead of multiple-choice questions like traditional medical AI tests, it mirrors how doctors actually work: starting with limited information about a patient, asking follow-up questions, ordering tests, and adjusting theories as new data arrives. Each test incurs a cost in virtual money, forcing the AI to balance thoroughness against healthcare spending. In other words, it basically simulates a medical council debating a case, with different models playing different roles. The models debate, disagree, and eventually reach a consensus, just like your physicians would if you were a challenging case to study. In one configuration, MAI-DxO achieved 80% accuracy while spending $2,397 per case, approximately 20% less than the $2,963 that physicians typically spend. At peak performance, it achieved 85.5% accuracy at a cost of $7,184 per case. By comparison, OpenAI's standalone o3 model achieved 78.6% accuracy but cost $7,850. The virtual physician panel includes Dr. Hypothesis, who maintains a running list of the three most likely diagnoses using Bayesian probability methods. Dr. Test-Chooser selects up to three diagnostic tests per round, aiming for maximum information gain. Dr. Challenger acts as the contrarian, seeking evidence that contradicts the prevailing theory. Dr. Stewardship vetoes expensive tests with low diagnostic value. Meanwhile, Dr. Checklist ensures all test names are valid and the team's reasoning stays consistent. Microsoft tested the system on cases published in the New England Journal of Medicine between 2024 and 2025, after the AI's training cutoff date, eliminating any possibility the model had memorized the answers. The studies were difficult cases that required thorough examination to be properly diagnosed. The 21 physicians Microsoft recruited for comparison had between 5 and 20 years of experience, with a median of 12 years. They worked without access to colleagues, textbooks, or AI assistance to ensure a fair comparison of raw diagnostic ability. They reported a 20% success rate on these admittedly difficult cases. The system operates in several modes. "Instant Answer" provides a diagnosis based solely on initial information for $300 -- the cost of one physician visit. "Question Only" allows follow-up questions without ordering tests. "Budgeted" tracks costs with a maximum spending limit. "No Budget" gives the panel free rein, while "Ensemble" runs multiple panels and aggregates their conclusions for maximum accuracy. MAI-DxO represents Microsoft's broader push into consumer health AI. The company reports over 50 million health-related sessions daily across its Bing and Copilot products. From knee pain searches to urgent care lookups, Microsoft sees search engines and AI assistants becoming the new front door for healthcare. Of course, this is just one more step in a very long timeline of medical tech. For context, Stanford's MYCIN system diagnosed bacterial infections in the 1970s, and Google's AMIE simulated doctor-patient conversations just last year. Microsoft developed MAI-DxO as a model-agnostic system, meaning it can work with AI models from different companies. In testing, it boosted performance across models from OpenAI, Google, Anthropic, Meta, and others by an average of 11%. The improvement was statistically significant across all tested models. Dr. Dominic King and Harsha Nori, who led the research at Microsoft AI, emphasized in a blog post that the technology remains a research demonstration. "Important challenges remain before generative AI can be safely and responsibly deployed across healthcare," they wrote. The system excels at complex diagnostic challenges but needs testing on routine cases. Microsoft plans to submit the research for peer review and is working with healthcare organizations to validate the approach in clinical settings. The company has made clear that any deployment would require "rigorous safety testing, clinical validation, and regulatory reviews." For now, MAI-DxO remains confined to research labs. But with diagnostic errors contributing to nearly 10% of patient deaths and affecting millions annually, Microsoft's virtual physician panel represents another step toward AI-assisted healthcare. The five-doctor AI team might diagnose better than 21 human physicians combined, but it is still too early to see a mainstream implementation. Microsoft says AI won't replace doctors; it will augment them. The 21 physicians who scored 20% on those brutal NEJM cases are probably hoping that's true.
[14]
Microsoft says AI outperforms doctors at complex medical diagnoses
The tech giant's AI tool made the correct diagnosis for the vast majority of patients in a small study. Microsoft said it is one step closer to "medical superintelligence" after a new artificial intelligence (AI) tool beat doctors at diagnosing complex medical problems. Tech giants are racing to develop superintelligence, which refers to an AI system that exceeds human intellectual abilities in every way - and they're promising to use it to upend healthcare systems around the world. For the latest experiment, Microsoft tested an AI diagnostic system against 21 experienced physicians, using real-world case studies from 304 patients that were published in the New England Journal of Medicine, a leading medical journal. The AI tool correctly diagnosed up to 85.5 per cent of cases - roughly four times more than the group of doctors from the United Kingdom and the United States, who had between five and 20 years of experience. The model was also cheaper than human doctors, ordering fewer scans and tests to reach the correct diagnosis, the analysis found. Microsoft said the findings indicate that AI models can reason through complex diagnostic problems that stump physicians, who specialise in their fields but are not experts in every aspect of medicine. However, AI "can blend both breadth and depth of expertise, demonstrating clinical reasoning capabilities that, across many aspects of clinical reasoning, exceed those of any individual physician," Microsoft executives said in a press release. "This kind of reasoning has the potential to reshape healthcare". Microsoft does not see AI replacing doctors anytime soon, saying the tools will instead help physicians automate some routine tasks, personalise patients' treatment, and speed up diagnoses. Microsoft's AI system made diagnoses by mimicking a doctor's process of collecting a patient's details, ordering tests, and eventually narrowing down a medical diagnosis. A "gatekeeper agent" had information from the patient case studies. It interacted with a "diagnostic orchestrator" that asked questions and ordered tests, receiving results from the real-world workups. The company tested the system with leading AI models, including GPT, Llama, Claude, Gemini, Grok, and DeepSeek. OpenAI's o3 model, which is integrated into ChatGPT, correctly solved 85.5 per cent of the patient cases, compared to an average of 20 per cent among the group of 21 experienced doctors. The researchers published their findings online as a preprint article, meaning it has not yet been peer-reviewed. Microsoft also acknowledged some key limitations, notably that the AI tool has only been tested for complicated health problems, not more common, everyday issues. The panel of doctors also worked without access to their colleagues, textbooks, or other tools that they might typically use when making diagnoses. "This was done to enable a fair comparison to raw human performance," Microsoft said. The company called for more real-world evidence on AI's potential in health clinics, and said it will "rigorously test and validate these approaches" before making them more widely available.
[15]
Microsoft says AI can diagnose tough medical cases better than physicians
According to the tech company, the new system acts as a "virtual panel of diverse physicians" that bridge the gap between needing multiple real-life general physicians and specialists in the search to be diagnosed. They call it "medical superintelligence." In a report published on Monday, Microsoft claims the generative AI tool is around four times more accurate than a human physician when it comes to diagnosing complex issues -- and that it does so at a lower cost. "Increasingly, people are turning to digital tools for medical advice and support. Across Microsoft's AI consumer products like Bing and Copilot, we see over 50 million health-related sessions every day," the report states. "From a first-time knee-pain query to a late-night search for an urgent-care clinic, search engines and AI companions are quickly becoming the new front line in healthcare."
[16]
Microsoft's New AI Tool Can Diagnose Patients More Accurately Than Doctors
Microsoft's AI provided 4 times more accurate diagnoses than doctors Microsoft researchers unveiled a new artificial intelligence (AI) system on Monday that can diagnose patients more accurately than human doctors. Dubbed the Microsoft AI Diagnostic Orchestrator (MAI-DxO), it includes multiple AI models and a framework that allows it to go through patient symptoms and history to suggest relevant tests. Based on the results, it then suggests possible diagnoses. The Redmond-based tech giant highlighted that apart from the accuracy of the diagnosis, the system is also trained to be cost-effective in terms of tests conducted. In a post on X (formerly known as Twitter), Mustafa Suleyman, the CEO of Microsoft AI, posted about the MAI-DxO system. Calling it a "big step towards medical superintelligence," he said the AI system can solve some of the world's toughest medical cases with higher accuracy and lower costs compared to traditional diagnostic measures. MAI-DxO simulates a virtual panel of physicians with diverse diagnostic approaches who collaborate to solve medical cases, the company said in a blog post. The Orchestrator includes a multi-agentic system where one provides a hypothesis, one picks the tests, two others provide checklists and stewardship, and the last challenges the hypothesis. Once a hypothesis passes this panel, the AI system can either ask a question, request tests, or provide the diagnosis if it feels it has enough information. In case it recommends a test, it performs a cost analysis to ensure that the overall cost remains reasonable. Interestingly, the system is model agnostic, meaning it can perform with any third-party AI models. Microsoft claims that the system boosts the diagnostic performance of every AI model that was tested. However, OpenAI's o3 fared the best by correctly solving 85.5 percent of the New England Journal of Medicine (NEJM) benchmark cases. The company said that the same cases were also given to 21 practising physicians from the US and UK, and all of them had between five to 20 years of clinical experience. The human doctors had an accuracy of 20 percent. MAI-DxO can be configured to operate within defined cost constraints, the company said. Once an input budget has been added, the system explores cost-to-value trade-offs while making diagnostic decisions. This helps in the AI system only ordering the necessary tests, instead of every possible test to rule out all causes of the symptoms. To assess the AI system, Microsoft also developed a new benchmark dubbed the Sequential Diagnosis Benchmark (SD Bench). Unlike typical medical benchmark tests that ask multiple-choice questions, this test assesses AI systems' ability to iteratively ask the right questions and order the right tests. Then it evaluates the answers by comparing them to the outcome published in the NEJM. Notably, the MAI-DxO is not yet approved for clinical use, and is meant as initial research into developing AI capability in diagnostic operations. Microsoft said that its AI system can only be approved for clinical usage after rigorous safety testing, clinical validation, and regulatory reviews.
[17]
Microsoft Claims Its AI Is Better Than Doctors at Diagnosing Patients, But 'You Definitely Still Need Your Physician'
Microsoft says its new AI tool performs four times better than experienced human doctors at diagnosing complex health conditions -- but the company's AI CEO says human physicians are still needed to help treat the illnesses. In a paper released on Monday titled "The Path to Medical Superintelligence," Microsoft introduced an AI tool that correctly diagnosed complex cases up to 85% of the time and arrived at the diagnoses more cost-effectively than human physicians. Related: 'No Longer Optional': Microsoft Staff Mandated to Use AI at Work, According to a New Report The team used more than 300 complex case studies from the New England Journal of Medicine (NEJM) and had the AI tool imitate a panel of physicians to find the right diagnosis. The AI tool asked questions of the data, recommended what tests the hypothetical patient should get, and arrived at a diagnosis as it learned more information. As researchers added more data to the AI tool, it updated its best estimate of the diagnosis in real-time and explained the reasoning behind its conclusion. Microsoft researchers said that the AI tool correctly "solved" more than eight out of 10 NEJM case studies -- much better than the average success rate of two out of 10 for human physicians. The AI also ordered fewer hypothetical tests to arrive at the right diagnosis, making it more cost-effective than a human doctor. Human doctors are usually characterized by breadth, like a general family physician, or depth, like a specialist. The team noted that the AI tool demonstrated "clinical reasoning abilities" that "exceed those of any individual physician" due to its ability to combine both breadth and depth of expertise. The AI tool isn't ready for clinical use yet and will only be approved after safety testing and clinical validation. However, Microsoft AI CEO Mustafa Suleyman says it could help Microsoft provide "high-quality" health advice in response to the 50 million health-related queries it receives every day through its Copilot AI assistant and Bing search engine. "Although this is just early research, we're hoping that as we get this into production, it will give everybody access to very high-quality health information," Suleyman told Yahoo! Finance on Monday. The AI tool examines existing medical information, synthesizes it, and gives it back to humans at the right time -- but it still needs human doctors to hold it accountable, Suleyman said. Doctors are also required to plan and oversee treatment after a diagnosis. "You definitely still need your physician," Suleyman told the outlet, adding that the AI tool will likely get rolled out "in partnership with physicians themselves." Related: Microsoft AI CEO Says Almost All Content on the Internet Is Fair Game for AI Training Microsoft is currently testing the AI tool in real clinical environments to see how it performs on the job before any broader rollout. Microsoft is one of the most valuable companies in the world at the time of writing, second only to Nvidia, with a market cap of over $3.6 trillion.
[18]
Microsoft's new AI tool a medical genius? Tech giant claims it is 4x more accurate than real doctors
Tech giant Microsoft, recently hit with a fresh round of layoffs, has developed a new medical AI tool that performs better than human doctors at complex health diagnoses, creating a "path to medical superintelligence". The Microsoft AI team shared research that demonstrated how AI can sequentially investigate and solve medicine's most complex diagnostic challenges -- cases that expert physicians struggle to answer. Tech company's AI unit, led by the British tech pioneer Mustafa Suleyman, has developed a system that imitates a panel of expert physicians tackling "diagnostically complex and intellectually demanding" cases. Microsoft AI Diagnostic Orchestrator (MAI-DxO) correctly diagnosed up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians. MAI-DxO also gets to the correct diagnosis more cost-effectively than physicians, the company said in a blog post. ALSO READ: Microsoft layoffs: Tech giant's sales head Judson Althoff asked to go on two-month leave. Here's why The Microsoft AI Diagnostic Orchestrator", or MAI-DxO for short, the AI-powered tool is developed by the company's AI health unit, which was founded last year by Mustafa Suleyman. The tech giant said when paired with OpenAI's advanced o3 AI model, its approach "solved" more than eight of 10 case studies specially chosen for the diagnostic challenge. When those case studies were tried on practising physicians - who had no access to colleagues, textbooks or chatbots - the accuracy rate was two out of 10. Microsoft said it was also a cheaper option than using human doctors because it was more efficient at ordering tests. When benchmarked against real-world case records, the new medical AI tool "correctly diagnoses up to 85% of NEJM case proceedings, a rate more than four times higher than a group of experienced physicians" while being more cost-effective.What's impressive is that these cases are from the New England Journal of Medicine and are very complex and require multiple specialists and tests before doctors can reach any conclusion. According to The Wired, the Microsoft team used 304 case studies sourced from the New England Journal of Medicine to devise a test called the Sequential Diagnosis Benchmark. A language model broke down each case into a step-by-step process that a doctor would perform in order to reach a diagnosis. ALSO READ: Melania should be on first boat: Deportation calls for US' First Lady gains traction amid Trump's immigration crackdown For this, the company used different large language models from OpenAI, Meta, Anthropic, Google, xAI and DeepSeek. Microsoft said that the new AI medical tool correctly diagnosed 85.5 per cent of cases, which is way better compared to experienced human doctors, who were able to correctly diagnose only 20 per cent of the cases. "This orchestration mechanism -- multiple agents that work together in this chain-of-debate style -- that's what's going to drive us closer to medical superintelligence," Suleyman told The Wired. Microsoft announced it is building a system designed to mimic the step-by-step approach of real-world clinicians -- asking targeted questions, ordering diagnostic tests, and narrowing down possibilities to reach an accurate diagnosis. For example, a patient presenting with a cough and fever might be guided through blood tests and a chest X-ray before the system determines a diagnosis like pneumonia. ALSO READ: Sean Diddy Combs' secret plan against his ex Jennifer Lopez emerges amid sex-trafficking trial Microsoft said its approach was able to wield a "breadth and depth of expertise" that went beyond individual physicians because it could span multiple medical disciplines. It added: "Scaling this level of reasoning - and beyond - has the potential to reshape healthcare. AI could empower patients to self-manage routine aspects of care and equip clinicians with advanced decision support for complex cases." Microsoft acknowledged its work is not ready for clinical use. Further testing is needed on its "orchestrator" to assess its performance on more common symptoms, for instance.
[19]
Satya Nadella Says Microsoft's AI 'Orchestrator' Beats Human Doctors On Tough Diagnoses -- What This Means For The Future of Healthcare - Microsoft (NASDAQ:MSFT)
On Monday, Microsoft Corporation MSFT CEO Satya Nadella unveiled an AI system that outperforms human doctors in diagnosing complex medical cases, signaling a major leap toward "medical superintelligence." What Happened: Microsoft detailed two major advances: SDBench and MAI-DxO in a blog post titled, "The Path to Medical Superintelligence." SDBench: This tool uses 304 case studies from the New England Journal of Medicine (NEJM) and turns them into interactive diagnostic simulations. The AI must think critically, ask questions, order tests and weigh the costs -- much like a real-life doctor. MAI-DxO: This is a model-agnostic orchestrator that simulates a panel of virtual physicians. It achieved 85.5% diagnostic accuracy, which is four times higher than experienced human doctors, while also reducing diagnostic costs. See Also: Elon Musk Backs Satya Nadella's View That AI Must Produce 'Socially Useful' Results: 'The Real Question In The Next Five Years Is...' "Excited to share two advances that bring us closer to real-world impact in healthcare AI," Nadella posted on X, formerly Twitter. "Together, these advances offer a blueprint for how AI can help deliver precision and efficiency in healthcare." Microsoft admitted that this technology isn't ready to be used in real medical settings yet. The "orchestrator" still needs more testing, especially to see how well it handles everyday symptoms and routine cases, the company said. Mustafa Suleyman, head of Microsoft AI, told The Guardian: "It's pretty clear that we are on a path to these systems getting almost error-free in the next 5-10 years. It will be a massive weight off the shoulders of all health systems around the world." Why It's Important: In the third week of June 2025, investors poured significant capital into AI startups and acquisitions, signaling strong confidence in artificial intelligence despite wider market uncertainty. Major funding rounds included NYC's Tennr, which raised $101 million to advance AI-driven clinical intelligence and RevelAi Health, which secured $3.1 million for predictive medical tools aimed at better patient outcomes and lower costs. Data analytics also saw big wins: PreciseDx landed $11 million to expand its AI-powered diagnostic imaging, while Typedef attracted $5.5 million to make complex analytics more accessible for businesses. Price Action: Microsoft (MSFT) shares edged up 0.30% during Monday's regular trading session but slipped slightly by 0.012% in after-hours trading, according to Benzinga Pro data. The stock has climbed 18.83% so far this year and is up 8.81% over the past 12 months. Benzinga's Edge Stock Rankings show that MSFT maintains a solid upward price trend in the short, medium, and long term. However, while its momentum remains strong, its value ranking is comparatively weaker. More detailed performance metrics are available here. Photo Courtesy: katuSka on Shutterstock.com Read Next: Amazon Loses Top AI Leader In High-Stakes Talent Shuffle Disclaimer: This content was partially produced with the help of AI tools and was reviewed and published by Benzinga editors. MSFTMicrosoft Corp$497.35-0.01%Stock Score Locked: Want to See it? Benzinga Rankings give you vital metrics on any stock - anytime. Reveal Full ScoreEdge RankingsMomentum70.76Growth49.89Quality33.97Value13.21Price TrendShortMediumLongOverviewMarket News and Data brought to you by Benzinga APIs
[20]
Microsoft Says AI Tool Outperforms Physicians on Complex Medical Cases | PYMNTS.com
By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions. MAI-DxO also achieved correct diagnoses more cost effectively than physicians, the company said in a Monday blog post. "For AI to make a difference, clinicians and patients alike must be able to trust its performance," the post said. "That's where our new benchmarks and AI orchestrator come in." Earlier benchmarks used to evaluate AI systems in medicine were based on the United States Medical Licensing Examination (USMLE), which is based on multiple-choice questions, favors memorization and therefore overstates the apparent competence of AI systems, according to the post. To overcome the limitations of that test, Microsoft AI developed a new one that requires sequential diagnosis and uses 304 recent cases published by NEJM, the post said. This test requires AI models and human physicians to ask questions, order tests and work toward a final diagnosis. Microsoft AI's test also includes a virtual cost that reflects real-world healthcare expenditures, per the post. The MAI-DxO, being an orchestrator, accesses multiple language models and integrates diverse data sources, according to the report. It is also configurable so that it can be told to operate within defined cost constraints. "Together with our partners, we strongly believe that the future of healthcare will be shaped by augmenting human expertise and empathy with the power of machine intelligence," the post said. "We are excited to take the next steps in making that vision a reality." The PYMNTS Intelligence and AI-ID collaboration "Generative AI Tracker®: Generative AI Can Elevate Health and Revolutionize Healthcare" found that while Americans are enthusiastic about the potential benefits of AI in healthcare, they still feel uncomfortable with the idea of healthcare providers relying on AI or replacing their medical professionals with this technology. Sixty percent of Americans said they are uncomfortable with a provider relying on AI in their healthcare, while 57% believe using AI to diagnose diseases and suggest treatments would harm the patient-provider relationship, according to the report.
[21]
New Microsoft AI System Beats Doctors in Complex Diagnosis
Microsoft's Artificial Intelligence (AI) team has shared research that highlights how AI can perform better than actual human doctors to solve some of the "most complex diagnostic challenges" in the field of medical science, as per a blog post published by the tech giant. Microsoft has developed an AI system that emulates the actions of a panel of expert physicians dealing with "intellectually demanding" medical cases taken from the New England Journal of Medicine (NEJM). This AI system -- called the Microsoft AI Diagnostic Orchestrator (MAI-DxO) -- working in conjunction with OpenAI's o3 model, solved more than 85% of cases from NEJM. In contrast, doctors from the US and UK, each with 5-20 years of clinical experience, who had no access to AI chatbots, textbooks, or colleagues, could only achieve a success rate of 20%. For context, Microsoft created interactive case challenges involving stepwise diagnostic encounters sourced from NEJM, where AI models -- or human physicians -- could put forth questions and order tests. Microsoft: AI Will Assist Doctors, Not Usurp Them MAI-DxO logged "higher diagnostic accuracy" and lower overall testing expenditure when compared to physicians or any individual foundation model. For context, the research project tested AI foundation models like ChatGPT, Llama, Claude, Gemini, Grok, and DeepSeek. Notably, the tech giant claimed that AI would play a complementary role in the healthcare setting, rather than becoming the primary presence in a medical clinic. Further, it highlighted that a doctor's role is much more than simply making a medical diagnosis of patients. "While this technology (AI) is advancing rapidly, their (doctors') clinical roles are much broader than simply making a diagnosis," the blog post read. "Clinical roles will, we believe, evolve with AI giving clinicians the ability to automate routine tasks, identify diseases earlier, personalise treatment plans, and potentially prevent some diseases altogether," it added. Microsoft also put forth its doubts about AI systems demonstrating their brilliance on medical examinations, such as the United States Medical Licensing Examination (USMLE). "In just three years, generative AI has advanced to the point of scoring near-perfect scores on the USMLE and similar exams. But these tests primarily rely on multiple-choice questions, which favour memorisation over deep understanding. "By reducing medicine to one-shot answers on multiple-choice questions, such benchmarks overstate the apparent competence of AI systems and obscure their limitations," the tech giant's blog post read. Why It Matters Even as AI makes strides in the healthcare sector -- illustrated by Microsoft's latest research -- doubts remain about its applicability in real-world settings. "Although MAI-DxO excels at tackling the most complex diagnostic challenges, further testing is needed to assess its performance on more common, everyday presentations," Microsoft conceded. Notably, the tech giant isn't the only company making forays into the health sector with AI - OpenAI recently came up with an open-source benchmark to evaluate AI's performance in health-related conversations. The Sam Altman-led company incorporated input from over 260 medical professionals across 60 countries and sampled to reflect on 5000 real-world patient-doctor interactions. However, doubts remain about whether or not such AI models can generate responses to patients without any kind of hallucination creeping in. Elsewhere, Microsoft has reassured everyone that it will release MAI-DxO only after rigorous safety tests and regulatory review. "The work presented here is not yet approved for clinical use and would only be approved after rigorous safety testing, clinical validation, and regulatory reviews," the company stated. "At the heart of any plans to deploy this technology in the real world is our commitment to safety, trust, and quality, ensuring that any healthcare solutions are clinically grounded, ethically designed, and transparently communicated," it added. This makes one thing amply clear: exciting times lie ahead at the intersection of AI and healthcare, but not before peer reviews and regulatory oversight deem the technology safe for use on patients in a real-world medical environment. Also Read:
[22]
Microsoft's medical AI system is four times more accurate than human doctors: Here's how
Microsoft's multi-agent medical AI uses GPT, Claude, Gemini, and more to deliver accurate, fast diagnoses. On July 1, Doctors day, a day when India and the world celebrate the selfless service of healthcare professionals, Microsoft has unveiled a breakthrough that could redefine the future of diagnostics. Its new system, the Microsoft AI Diagnostic Orchestrator (MAI‑DxO), has achieved an astonishing 85.5% diagnostic accuracy in complex medical cases, over four times more accurate than experienced human doctors working under controlled test conditions. While it's not intended to replace clinicians, MAI‑DxO could become a powerful second opinion in hospitals and clinics, especially where access to specialist expertise is limited. Microsoft's AI CEO Mustafa Suleyman says this is "a big step toward medical superintelligence." Also read: AI in healthcare 2024: AI powered hardware that made living better At its core, MAI‑DxO simulates a collaborative team of digital physicians, each responsible for different stages of the diagnostic process. When given a case like a patient with a high fever, shortness of breath, and fatigue, the system doesn't just guess the diagnosis. Instead, it follows a step-by-step process similar to how human doctors reason through complex symptoms: This entire process is designed to mimic the rigor of a seasoned team of physicians, but compressed into seconds. Microsoft tested MAI‑DxO using 304 of the most complex diagnostic puzzles from the New England Journal of Medicine, a gold-standard in medical literature. These cases are far more nuanced than typical textbook problems. In a direct comparison, MAI‑DxO achieved an 85.5% diagnostic accuracy, while the average for 21 experienced human doctors from the U.S. and U.K. was just 20%. The system also reduced diagnostic costs by an average of 20%, thanks to its cost-aware decision-making. Also read: Satya Nadella on AI progress: What Microsoft CEO revealed in a talk The benchmark used is called SD-Bench (Sequential Diagnosis Benchmark) and it better replicates real-world workflows than traditional multiple-choice tests like the USMLE. It challenges both humans and machines to proceed step-by-step, with limited information and real-time decision-making. MAI‑DxO is model-agnostic, meaning it can work with any LLM that meets medical reasoning standards. It uses a five-agent framework, with each AI persona simulating a distinct function, like differential diagnosis, test selection, and final review. It's also designed to operate within configurable budget constraints, which could be game-changing in low-resource environments like rural clinics or public hospitals. While the system is still in the research phase, its potential applications are massive. Microsoft envisions MAI‑DxO assisting clinicians in hospitals, enhancing consumer health tools like Bing and Copilot, and integrating into documentation platforms like Dragon Copilot or radiology pipelines via RAD‑DINO. With healthcare waste in the U.S. alone estimated at over $1 trillion annually, MAI‑DxO could help reduce unnecessary testing and curb spiraling costs. Despite its promise, MAI‑DxO hasn't yet been tested in real-world clinical settings, where time pressure, incomplete records, and patient variability add layers of complexity. Critics also note that the doctors in the benchmark weren't allowed access to Google or clinical tools, potentially widening the gap. Moreover, questions around data bias, patient equity, and regulatory approval remain unanswered. Microsoft says clinical trials for broader ailments, not just rare or complex cases, are next, followed by safety testing and approvals. On a day meant to celebrate doctors, Microsoft's announcement is not a threat, but a tribute, highlighting how AI might one day become a tireless partner to clinicians everywhere. If MAI‑DxO holds up in the real world, it could revolutionise diagnosis, democratise specialist-level care, and support the very doctors it aims to emulate.
Share
Copy Link
Microsoft's new AI system, MAI-DxO, demonstrates superior diagnostic accuracy and cost-effectiveness compared to human doctors in complex medical cases, potentially revolutionizing healthcare diagnostics.
Microsoft has announced a significant breakthrough in medical artificial intelligence with its new AI Diagnostic Orchestrator (MAI-DxO), a system that has demonstrated remarkable capabilities in diagnosing complex medical cases 1. The tech giant claims this AI tool can diagnose patients four times more accurately and at a significantly lower cost compared to human physicians 2.
Source: The Telegraph
To evaluate the AI's performance, Microsoft developed the Sequential Diagnosis Benchmark (SDBench), using 304 case studies from the New England Journal of Medicine (NEJM) 1. This benchmark was designed to replicate the step-by-step diagnostic process used by human doctors, allowing for a more realistic assessment of AI capabilities in medical diagnosis 3.
In the experiment, MAI-DxO achieved an impressive 85% diagnostic accuracy on the NEJM cases, significantly outperforming human doctors who achieved only 20% accuracy 4. The AI system not only demonstrated superior diagnostic skills but also reduced costs by 20% through more efficient selection of tests and procedures 1.
Source: Asian Scientist Magazine
MAI-DxO functions as a "virtual panel of physicians" by combining several leading AI models, including OpenAI's GPT, Google's Gemini, Anthropic's Claude, Meta's Llama, and xAI's Grok 1. This orchestration of multiple AI agents allows for a more comprehensive and nuanced approach to diagnosis, mimicking the collaborative efforts of human medical experts 2.
One of the most promising aspects of MAI-DxO is its potential to reduce healthcare costs. The system is designed to be cost-conscious, significantly reducing the number of tests required for accurate diagnosis 4. This feature could have far-reaching implications for healthcare systems struggling with high costs and resource allocation 5.
While the results are impressive, Microsoft emphasizes that MAI-DxO is still in the research phase and not yet ready for clinical implementation 4. However, the company sees potential for integrating this technology into consumer-facing products like Bing and Copilot, which already handle millions of health-related queries daily 2.
Source: ZDNet
Despite its promise, the deployment of AI in healthcare raises important questions about data privacy, algorithmic bias, and the role of human judgment in medical decision-making 1. Microsoft acknowledges that while AI can surpass individual physician capabilities in some aspects of clinical reasoning, it is not intended to replace human doctors 3.
As AI continues to advance in the medical field, it presents both exciting opportunities and significant challenges. The development of tools like MAI-DxO suggests a future where AI could play a crucial role in supporting healthcare professionals, potentially improving patient outcomes and reducing costs across the healthcare system.
OpenAI introduces Study Mode for ChatGPT, designed to enhance learning experiences by encouraging critical thinking rather than providing direct answers. This new feature aims to address concerns about AI's impact on education and student learning.
20 Sources
Technology
15 hrs ago
20 Sources
Technology
15 hrs ago
Microsoft and OpenAI are negotiating a new deal that could ensure Microsoft's continued access to OpenAI's technology, even after achieving AGI. This comes as OpenAI diversifies its cloud partnerships, potentially challenging Microsoft's AI edge.
11 Sources
Technology
23 hrs ago
11 Sources
Technology
23 hrs ago
Anthropic, the AI startup, is close to securing a massive funding round led by Iconiq Capital, potentially valuing the company at $170 billion. This development highlights the growing investor interest in AI companies and the increasing involvement of Middle Eastern capital in the sector.
4 Sources
Business and Economy
14 hrs ago
4 Sources
Business and Economy
14 hrs ago
Meta CEO Mark Zuckerberg's ambitious pursuit of AI talent and superintelligence capabilities faces challenges as the company reports slower growth amid rising costs. The tech giant's strategy includes massive investments in AI infrastructure and high-profile hires, but questions remain about its open-source approach and the performance of its Llama 4 model.
7 Sources
Technology
15 hrs ago
7 Sources
Technology
15 hrs ago
Google introduces new AI Mode features including Canvas for study planning, image and PDF uploads on desktop, and real-time video input for Search Live, aimed at improving research and learning experiences.
11 Sources
Technology
15 hrs ago
11 Sources
Technology
15 hrs ago