2 Sources
[1]
Generative AI-enabled clinical decision support system in primary care: a pragmatic, cluster-randomized trial - Nature Medicine
In this large-scale pragmatic randomized controlled trials of a generative LLM embedded in routine clinical workflows across the full spectrum of primary care, we found similar rates of 14-day treatment failure between groups, extending emerging evidence from recent randomized evaluations in other clinical settings. The estimated effect corresponded to between 13 fewer and 1 additional treatment failures per 1,000 patients, indicating that any true effect, if present, is likely to be modest. Interpretation of the primary outcome should consider both the magnitude and precision of the estimated effect. The observed event rate was lower than anticipated, resulting in limited precision for detecting modest effects. Nonetheless, the findings provide bounded inference: large clinically meaningful effects are unlikely, while smaller effects cannot be ruled out. This pattern probably reflects the complexity of clinical outcomes in primary care, which are influenced by a range of broader contextual factors than just the care provided in clinic, as well as the fact that the trial was powered for a larger effect size than the one observed. That said, there is strong evidence that the intervention improved the quality of clinical documentation and care (as demonstrated by improved diagnostic reasoning and appropriate treatment planning), reduced antibiotic-related costs and was generally safe. However, the trial was not powered to detect rare severe adverse events, and the absence of observed differences between groups should not be interpreted as evidence of safety equivalence. While no intervention-related safety concerns were identified, the data provide limited precision regarding rare harms, and uncertainty remains without a prespecified noninferiority or safety framework. The intervention was implemented as a workflow-integrated decision support tool that generated recommendations automatically during routine documentation, without requiring clinicians to actively initiate its use. Clinicians retained autonomy to accept, modify or disregard the system's suggestions, and no additional incentives or enforcement mechanisms were introduced to promote uptake. As a result, the trial evaluates effectiveness under routine care conditions rather than efficacy under enforced use. While variable engagement with the system may have attenuated observable effects on clinical outcomes, this design reflects real-world implementation conditions and supports the external validity of the findings. While objective clinical endpoints such as hospitalization and death (that is, serious adverse events) are ultimately the most meaningful measures of impact, these events are rare in primary care. Post hoc power simulations based on the initial results reported indicated that detecting modest differences in rare clinical events would require substantially larger sample sizes (for example, >100,000 patients) than those feasible in the present study (Sam Waton & Jishnu Das, personal communication, 25 July 2025). Given the logistical and resource implications, and following discussion among the relevant stakeholders, a decision was made not to extend the trial. As this study exemplifies, there remains an open question around what is the most appropriate primary outcome for evaluating a general-purpose technology (for example, an LLM-based clinical decision support (CDSS)) in a broad field (for example, primary healthcare), where metrics specific to any individual disease do not sufficiently capture the value proposition, and objective patient-level impacts require a scale of intervention that is often infeasible. The improvement in process outcomes observed aligns with findings from both controlled and real-world studies. In a 40,000-encounter quality-improvement study conducted within the same primary care network as the present trial, an almost identical tool produced measurable gains in documentation completeness and adherence to clinical standards, which, in turn, mediated reductions in diagnostic and treatment errors. Similarly, findings from a pilot trial in Japan reported that provider-in-the-loop note generation using an LLM improved documentation quality across all domains compared to provider-only records, while maintaining accuracy and efficiency. Assuming previously described theories of change hold, such incremental process improvements may ultimately translate into meaningful gains for patient outcomes. The intervention did not change antibiotic prescribing rates among febrile patients. This may be explained by how deeply ingrained antibiotic prescribing for fever is in clinical practice, or the small size of this subgroup. That said, the observed (statistically significant) savings could be due to the increased power afforded by analyzing all trial participant data, and/or to the use of cheaper antibiotics (while maintaining a similar rate of appropriate prescriptions). This is noteworthy, as at the health-system level, many governments are working to advance universal health coverage amid tightening fiscal space and declining donor support. In the trial population alone, the direct per-patient savings from reduced antibiotic prescribing exceeded the per-patient cost of running the LLM, suggesting that such tools can generate net savings at the operational level, even before broader system or clinical benefits are considered; there is probably more nuance to consider when evaluating whether the technology is truly cost-saving once the total cost of ownership is calculated, but this is a positive signal nonetheless. The lower proportion of patients classified as at risk of type 2 diabetes in the intervention arm may reflect improved detection of individuals whom the LLM had already identified as having established diabetes. Thus, the intervention appears to have shifted a subset from the 'at-risk' category into known (pre-existing) diagnosis, a reclassification that may yield downstream efficiencies and cost savings by reducing unnecessary screening and assessments. Finally, patients did not report a difference in satisfaction between arms, despite a slightly longer consultation duration in the LLM arm. Evidence from a parallel mixed-methods evaluation within the same network provides insight into why this might be -- clinical officers viewed the 'AI Consult' as a complementary aid that improved diagnostic reasoning and thoroughness while preserving autonomy and rapport with patients. In combination with the user interface never being exposed to the patient, it is understandable why they did not perceive a difference in the consultation. Such findings reinforce that when well-integrated into clinical workflows, LLM-based decision-support systems can strengthen the quality of care without diminishing the human elements essential to patient trust and satisfaction. However, sustained reliance on automated reasoning could also erode providers' skills. More research is required to understand which deliberate design choices promote critical reasoning and potentially even upskilling, while avoiding cognitive offloading in a manner conducive to deskilling. The study had the following limitations: although the trial was randomized at the level of the clinical officer, it was conducted within shared facility environments, and informal exchange of clinical reasoning between clinicians could not be fully prevented. However, the LLM-assisted interface was accessible only to clinical officers assigned to the intervention arm through secure, role-based login credentials within the EMR, which helped limit cross-arm exposure. Nevertheless, any residual contamination would be expected to bias results toward the null, potentially attenuating observable differences between groups. Conducting the trial within a single private network of urban clinics in Nairobi may limit generalizability to rural, periurban or public-sector settings with different patient populations, staffing patterns and infrastructure. Because clinicians were randomized within shared facilities, informal exchange of clinical approaches between providers could not be fully excluded, which may have reduced contrast between groups and biased estimates toward the null. It is also possible that the effect of the LLM was attenuated by the relatively high baseline standards of care within the study network. Penda Health operates a structured quality-improvement framework with regular clinical audits, peer review and performance feedback through its EMR. These features probably narrowed the margin for measurable improvement compared to less digitized or lower-resourced environments, where the same intervention might yield greater benefit. As with all AI interventions, performance is tied to the specific model version and data distribution. Given the rapid pace of model evolution, newer versions are likely to show improved reasoning, safety filtering and bias control; our results should therefore be viewed as a temporal benchmark rather than a fixed estimate of capability. Finally, the 14-day follow-up period may have been too short to capture downstream effects such as reduced errors, improved continuity of care or operational efficiencies. In summary, the intervention did not reduce short-term treatment failure, and no safety concerns were identified. Larger or adequately powered studies are needed to determine whether modest clinical benefits exist with greater precision.
[2]
Clinical trial evaluates generative AI support tool in primary care
University of BirminghamJun 26 2026Reviewed A large real-world clinical trial has found that a generative AI-powered support tool used to support frontline clinicians was safe and improved the quality of clinical decision-making but did not significantly change short-term patient outcomes. The study, published today in Nature Medicine is one of the first randomized controlled trials worldwide to test whether generative AI can improve patient-level outcomes, rather than just clinician performance or simulated cases. The trial involved more than 9,600 patients attending 16 primary care clinics in Kenya, and was delivered by experts at the University of Birmingham supported by the National Institute for Health and Care Research (NIHR) Biomedical Research Centre: Birmingham. Clinicians were randomly assigned to use an electronic medical record system with or without an integrated AI consult tool that provided real-time diagnostic and treatment suggestions. The AI system, known as 'AI Consult', was a large language model-based clinical decision support tool embedded directly within the existing electronic medical record system. During consultations, the tool worked in the background by: * Analysing information entered by the clinician into the medical record * Generating context‑specific diagnostic and treatment suggestions, aligned with Kenyan national clinical guidelines * Flagging potential concerns using a simple color‑coded alert system (green, yellow or red) Clinicians retained full autonomy; they were not required to follow the AI's advice, and retained responsibility for all diagnosis, prescribing and referral decisions. The AI interface was not visible to patients, helping preserve normal patient-clinician interaction. This is one of the first studies to rigorously ask the hardest question about AI in healthcare: whether it actually improves outcomes for patients. What we found is reassuring but also sobering. The technology appears safe and clearly improves aspects of clinical decision-making, but translating those gains into measurable patient benefit is much more challenging, particularly in everyday primary care." Professor Bilal Mateen, Senior Author, Honorary Professor of Machine Learning for Health, University of Birmingham, and Chief AI Officer at PATH Serious outcomes such as hospitalisation or death are rare in primary care, meaning extremely large studies - potentially involving more than 100,000 patients - would be needed to detect modest effects. Professor Alastair Denniston, co-author, Professor of Regulatory Science and Innovation at the University of Birmingham and lead for health data research at the NIHR Biomedical Research Centre: Birmingham, said: "A large part of primary care is to deal with common conditions, including those that are self-limiting, where many patients require low levels of healthcare intervention. In that context, even meaningful improvements in clinical reasoning may only result in small changes in patient outcomes that are very difficult to measure. "What this study shows is that AI can be integrated safely into real clinical workflows, without undermining patient trust or clinician autonomy - which is a critical foundation for any future impact." Findings: safety, quality and costs Researchers found no statistically significant difference in treatment failure within 14 days between patients seen with AI-supported care and those receiving standard care (2.2% vs 2.0%). The study found no evidence of harm, with similar rates of hospitalisation and death in both groups. While the AI tool did not produce measurable improvements in short-term patient outcomes, it significantly improved the quality of clinical documentation and treatment planning, as assessed by an independent panel of experienced clinicians who were blinded to whether AI had been used. Patient satisfaction was the same in both groups, suggesting that AI support did not alter patients' experience of care. The study also found that, although overall antibiotic prescribing rates were similar, antibiotic‑related costs were lower in the AI‑supported group, due to more cost-conscious prescribing choices. Although the trial was conducted in Kenya, the researchers emphasize that the findings have global relevance, including for high-income health systems. Professor Richard Riley, Professor of Biostatistics at the University of Birmingham and senior author, said: "Robust trials like this are so important to establish the real impact of using AI in practice. They help set realistic expectations of what AI can actually contribute within existing care pathways, and helps guide where future investment and research effort should be focused. Generalisability of our findings to higher-income settings, where baseline standards of care are already high, needs to be evaluated." The study was funded by the Gates Foundation, sponsored by PATH, and conducted with collaborators from the London School of Hygiene and Tropical Medicine and the KEMRI-Wellcome Trust Research Programme, Kenya. Source: University of Birmingham Journal reference: Agweyu, A., et al. (2026). Generative AI-enabled clinical decision support system in primary care: a pragmatic, cluster-randomized trial. Nature Medicine. DOI: 10.1038/s41591-026-04503-6. https://www.nature.com/articles/s41591-026-04503-6
Share
Copy Link
A randomized controlled trial involving over 9,600 patients across 16 primary care clinics in Kenya tested whether generative AI can improve patient-level outcomes in real-world settings. The AI Consult tool improved clinical documentation and decision-making quality while reducing antibiotic costs, but showed no significant impact on short-term treatment failure rates. The findings raise important questions about measuring AI's value in primary care.
A groundbreaking randomized controlled trial published in Nature Medicine has evaluated whether generative AI can deliver measurable benefits to patients in primary care settings, moving beyond simulated cases to test real-world effectiveness
1
2
. The study involved more than 9,600 patients attending 16 primary care clinics in Kenya, making it one of the first large-scale trials to rigorously examine whether AI in healthcare actually improves patient outcomes rather than just clinician performance2
.
Source: News-Medical
Clinicians were randomly assigned to use an electronic medical record system either with or without AI Consult, an AI-powered clinical decision support system embedded directly into their workflow
2
. The generative AI support tool analyzed information entered during consultations, generated context-specific diagnostic and treatment suggestions aligned with Kenyan national clinical guidelines, and flagged potential concerns using a color-coded alert system2
. Critically, clinicians retained complete autonomy to accept, modify, or disregard the system's recommendations, with the AI interface remaining invisible to patients1
2
.The clinical trial found similar rates of 14-day treatment failure between groups, with 2.2% in the AI-supported care group versus 2.0% in standard care
2
. This corresponded to between 13 fewer and 1 additional treatment failures per 1,000 patients, suggesting any true effect is likely modest1
. The study found no evidence of harm, with similar rates of hospitalization and death in both groups2
.Professor Bilal Mateen, Senior Author and Honorary Professor of Machine Learning for Health at the University of Birmingham, noted: "This is one of the first studies to rigorously ask the hardest question about AI in healthcare: whether it actually improves outcomes for patients. What we found is reassuring but also sobering. The technology appears safe and clearly improves aspects of clinical decision-making, but translating those gains into measurable patient benefit is much more challenging"
2
.While patient outcomes remained unchanged, the intervention significantly improved the quality of clinical documentation and treatment planning, as assessed by an independent panel of experienced clinicians who were blinded to whether AI had been used
2
. The trial demonstrated enhanced diagnostic reasoning and appropriate treatment planning, alongside reduced antibiotic-related costs due to more cost-conscious prescribing choices1
2
. Notably, the intervention did not change overall antibiotic prescribing rates among febrile patients, possibly reflecting how deeply ingrained certain prescribing practices are in clinical workflows1
.Patient satisfaction remained identical in both groups, indicating that the generative AI support tool did not alter patients' experience of care or undermine the patient-clinician relationship
2
.Related Stories
The AI-powered clinical decision support system was implemented as a workflow-integrated tool that generated recommendations automatically during routine documentation, without requiring clinicians to actively initiate its use
1
. This design reflects real-world implementation conditions rather than enforced use scenarios, supporting the external validity of the findings1
. Professor Alastair Denniston, co-author and Professor of Regulatory Science and Innovation at the University of Birmingham, emphasized: "What this study shows is that AI can be integrated safely into real clinical workflows, without undermining patient trust or clinician autonomy - which is a critical foundation for any future impact"2
.The trial highlights a fundamental challenge in evaluating general-purpose AI technologies in primary care: serious outcomes such as hospitalization or death are rare, meaning extremely large studies involving potentially more than 100,000 patients would be needed to detect modest effects
1
2
. The observed event rate was lower than anticipated, resulting in limited precision for detecting modest effects, though large clinically meaningful effects are unlikely based on the bounded inference1
.The study was funded by the Gates Foundation and conducted with collaborators from the London School, with findings that researchers emphasize have global relevance beyond Kenya
2
. Professor Richard Riley, Professor of Biostatistics at the University of Birmingham, stated: "Robust trials like this are so important to establish the real impact of using AI in practice. They help set realistic expectations of what AI can actually contribute within existing care pathways"2
.Summarized by
Navi
30 Apr 2026•Science and Research

29 Oct 2024•Science and Research

23 Jul 2024

1
Technology

2
Policy and Regulation

3
Technology
