6 Sources
6 Sources
[1]
Diagnostic accuracy, fairness and clinical implementation of AI for breast cancer screening: results of multicenter retrospective and prospective technical feasibility studies - Nature Cancer
The retrospective evaluation covered five breast screening services from across the UK, representing three distinct clinical workflows, including 125,000 women aged 50-70, who were screened in 2015-2016, as summarized in Table 1. The final analysis included 115,973 women after applying inclusion and exclusion criteria (Extended Data Fig. 1a and Supplementary Table 1). The AI system achieved superior sensitivity and noninferior specificity to first reader, second reader and consensus decision after arbitration, at a case and breast level (noninferiority margin: 5%, P < 0.001 for all; Fig. 2a,b). Across all services, AI cancer detection rate (CDR) was higher versus the first human reader (9.33 per 1,000 women, 95% confidence interval (CI): 8.78, 9.88) versus 7.54 per 1,000 women, 95% CI: 7.04, 8.03), although the AI recall rate was higher than the first reader (6.5%, 95% CI: 6.4, 6.7 versus 5.5%, 95% CI: 5.3, 5.6). Performance was sustained across all five services, despite varying cohorts and clinical screening practices (Fig. 2c). Full results are presented in Supplementary Tables 2 and 3. The AI system demonstrated lesion-level sensitivity of 0.550 (95% CI: 0.512, 0.588) (Extended Data Fig. 2a). There was no comparator for human reads, as specialists do not routinely mark suspicious lesions on screening images in a digital form. For the best comparison, case-level sensitivity was 0.61 for the same two sites. To facilitate comparison to studies that do not consider interval cancers in their ground truth, the AI system achieved case-level sensitivity of 0.913 (95% CI: 0.895, 0.932) and specificity of 0.941 (95% CI: 0.940, 0.942), with an area under the receiver operating characteristic (ROC) curve (AUC) of 0.978 (Extended Data Fig. 2b,c) when considering screen-detected cancers only. The AI reader particularly outperformed when analyzing prevalent screens (women attending for the first time) compared to those who had been screened previously (termed 'incident' screens) (Fig. 2d). For these prevalent screens, the AI system achieved the lowest recall rate (7.1%, 95% CI: 6.7, 7.5) versus first human reader (11.8%, 95% CI: 11.3, 12.3) and consensus read (8.5, 95% CI: 8.0, 8.9), while also achieving the highest CDR (AI: 10.0 versus R1: 9.19 per 1,000; difference: 0.81, 95% CI: -0.03, 1.64). For incident cases, the AI achieved the highest CDR but also the highest recall rate. These results were largely consistent at the individual service level (Extended Data Fig. 3a-f). The AI system correctly identified 25.0% (95% CI: 20.4%, 30.0%) of future interval cancer cases, with 88.0% of these localized to the correct breast and 58.1% localized to the precise lesion. For next-round cancers that were only identified at the subsequent asymptomatic screening visit 3 years later, the AI system correctly identified 25.1% (95% CI: 22.1%, 28.1%) of cancer cases, again with 85.7% of cases correctly localized to the relevant breast and 53.1% localized to the precise lesion. We observed no notable differences in performance between the AI and the first human reader across the subgroups tested (Fig. 3). This included age, index of multiple deprivation, ethnicity and breast density. Two subgroups were borderline for sensitivity and failed noninferiority at a prespecified 5% margin, index of multiple deprivation (IMD) decile 1 (AI versus R1 difference: +0.070, 95% CI: -0.103, 0.244; n = 2,192) and mixed ethnicity (AI versus R1 difference: -0.048, 95% CI: -0.160, 0.000; n = 1,132), although both were groups with few positive cases, limiting the strength of statistical conclusions possible. AI specificity was within a 5% noninferiority margin for all groups, with the exception of women attending screening for the first time and age group 50-54, where AI specificity was significantly higher. AI generally exceeded first human reader sensitivity, particularly for women over 65 years of age. The distribution of disease detected with AI tended to favor higher-risk over lower-risk cancers. Compared to the first reader, the AI system achieved higher sensitivity for higher-risk cancers (0.55 versus 0.44; difference: 0.109, 95% CI: 0.083, 0.135; superiority P < 0.001) and noninferior sensitivity for lower-risk cancers (0.53 versus 0.47; difference: 0.052, 95% CI: -0.021, 0.125; noninferiority P = 0.003). For invasive cancers alone, the AI system achieved superior sensitivity compared to first, second and consensus decisions (0.54 versus 0.43, 0.46 and 0.46, respectively, P < 0.001 for all). When considering maximum lesion size per case, AI sensitivity performed favorably versus human readers across the range but especially outperformed for 20-30-mm lesions (Extended Data Fig. 4). Because of the low prevalence of cancer, CIs for many subgroups were large despite the large size of the study. Performance was essentially consistent across Hologic, GE and Siemens devices included in the study. However, within Hologic, we noted that cases imaged using the newer Hologic Selenia Dimensions (n = 4,692) demonstrated a distribution shift compared to the older Hologic Lorad Selenia (n = 77,840), resulting in a higher recall rate of 10.9% (95% CI: 10.0, 11.8) versus 6.3% (95% CI: 6.1, 6.4). Full results are presented in Supplementary Table 2. We assessed model calibration across different subgroups of interest (Extended Data Fig. 5a-f). Overall, there were no concerning disparities between subgroups within the range of the operating points (OPs) selected for this study. We noted that the Asian ethnicity and age 50-59 subgroups were somewhat overcalled compared to the others, although these were most evident at higher model OPs, outside of the range set for each site. We considered the clinical and operational effect of replacing one of the two historical readers before arbitration. Using AI as a second reader resulted in a 32.1% reduction in total reader time required (195,983 versus 288,616 equivalent reads), while CDR was increased by 17.7% (from 8.7 to 10.2 per 1,000) or 20.2% (from 8.5 to 10.2 per 1,000) for sites that arbitrate all recalls or arbitrate only discordances, respectively. Overall, including cases unable to be processed by the AI, which retain their traditional double-read workflow, the number of total human screening reads performed before arbitration was reduced by 46.4% (133,943 versus 249,916 reads), while arbitration reads required were increased by 60.3% (12,408 versus 7,740 reads). Some services use radiographers to perform screening reads but not arbitration reads; thus, estimates of overall workforce cost will vary on the basis of local variation. An example cost sensitivity analysis across the range of model OPs is presented in Extended Data Fig. 6a. The complementary nature of human and AI reading is highlighted in the detection patterns, with substantial but incomplete overlap between cancers identified by each approach (Extended Data Fig. 6b,c). Of the 40 cancers detected by human double reads but missed by the human + AI approach, 35 (88%) were deemed high risk. Of the 231 cases detected by the human + AI approach but missed by human double reads, 215 (93%) were deemed high risk. This suggests that the distribution of cancers is subtly shifted toward higher-risk tumor types when incorporating AI into the reading workflow. Two screening services were included in the prospective deployment covering 12 screening sites across London. Characteristics of the dataset included are summarized in Table 2, with a data flow diagram in Extended Data Fig. 1b. In total, 43 women opted out of participating in the study. While not powered for significance, AI and human reader performance is shown in Fig. 4a,b. We implemented an iterative OP calibration process, setting initial thresholds on the basis of available historical data, followed by monitoring of recall rates, as described fully in the Methods. After approximately 2 weeks, we reviewed the initial metrics available including primarily recall rate. Service 1 had an AI recall rate of 11.3% (human first reader 3.8%), while service 2 had an AI recall rate of 12.3% (human first reader 5.3%). These were both above our target recall rates; thus, we adjusted the OP using the prospectively collected data. The second period of the study had AI recall rates for service 1 of 6.7% (human first reader: 4.7%) and service 2 of 10% (human first reader: 4.7%). There was substantial week-to-week variation in cohort, as shown in Fig. 4c, highlighting the challenges in detecting drift in this type of low-prevalence screening population. Across both sites, time from screen to completed AI read was 17.7 min (interquartile range (IQR): 8.5-37.6 min), while time from screen to first human read was 2.08 days (IQR: 1.00-3.81 days). Relative accuracy was maintained in the prospective deployment compared to the retrospective study (breast-level AUC: 0.98 for both; Extended Data Fig. 7a-f and Fig. 4a, top row versus second and third rows) when reanalyzing using a comparable ground-truth definition. When deployed at the screening sites, the initial OP was set to an overly sensitive threshold, resulting in a higher recall rate and associated lower specificity. This highlighted a distribution shift between the original data used to set the OP from 2016 and the newer deployment time period in 2023, with both human and AI being affected (Fig. 4b). Marked variations in cohort can be seen from observing AI and human performance data in week-to-week plots (Fig. 4c), highlighting the challenges in closely monitoring AI accuracy and safety after deployment. The National Health Service (NHS) Breast Screening Programme (BSP) mandates a double-read workflow using human readers. In our AI-enabled workflow, an AI system replaces the second reader. However, as the AI cannot process all cases (for example, for cases that do not meet the intended use criteria or because of technical failure), the ability to invoke a human second reader remains necessary. At the time of the study, the National Breast Screening System (NBSS) software lacked functionality to automatically write back AI results and required an 'assessment reason' for recalls, which AI does not currently provide. Therefore, to ensure that the BSP is AI ready from a technical perspective, changes will be required to both national program guidelines and the IT system. Despite elements of national standardization, the BSP allows for site-specific variability in other aspects of the workflow (Supplementary Table 4). This includes some flexibility in reading practices, devices and staffing allocations. This allows local services to adapt workflows to their unique circumstances such as local reading performance goals and staffing levels, which needs to also be considered for AI workflows. An important factor for the feasibility of AI adoption is workflow digitization and standardization. At the time of research, eight of nine services interviewed relied on paper to drive the workflow (Supplementary Table 4). Readers relied upon client worksheets and physical processes to indicate which reading step was needed. Despite introducing redundancy and complexity, paper documentation was viewed as an important failsafe to check that the correct results were sent. AI cannot interact with physical documents; thus, full digitization would simplify an AI-enabled workflow (Extended Data Fig. 8). Furthermore, full standardization of workflow data collection (for example, including DICOM tags) would also facilitate AI integration. However, achieving this level of digitization will require extensive investment and effort. Further details about the workflow design process are available in Supplementary Note 1.
[2]
Prospective evaluation of artificial intelligence integration into breast cancer screening in multiple workflow settings: the GEMINI study - Nature Cancer
For healthcare settings wanting to reduce workload, using AI for triage enables single human reading for a subset of cases where AI can serve as the second reader. Compared to routine screening, one modeled 'triage' workflow (OP1) would result in the highest workload savings of 44%, with a 5.8% reduction in the recall rate (Table 4). PPV increased by 5.1%, with one cancer missed compared to routine double reading (Table 4). A 'triage negatives' workflow (OP3) would reduce workload by 37% and the recall rate by 12.2% while identifying all cancers detected by routine screening. The PPV would increase by 13.8%, resulting in the highest reduction of false positives (Table 4). The methodological approach used in the GEMINI study demonstrates multiple AI implementation strategies that could improve breast screening programs. The primary AI workflow was superior to routine double reading for cancer detection (1 per 1,000, 10.4% increase) without increasing the number of women recalled for further investigation. Workload savings of up to 31% could be achieved. Other workflows assessed show that AI could be adjusted to fit the different needs of breast screening programs and sites. Sites could select a workflow based on their priorities, such as detecting more cancers, reducing recalls or saving on workload, to find the option that best aligns with their operational goals and capacity. Similarly to previous prospective studies, our study showed that AI can increase cancer detection while decreasing workload, demonstrating the generalizability of these findings to a UK screening population. The primary AI workflow demonstrated a similar additional CDR (1 per 1,000) to the Swedish MASAI trial, which used AI to triage screening examinations. In the MASAI trial, workload was reduced by 44%. In Denmark, where AI is routinely used in one region, workload has been reduced by 33.5% and recall rates by 20.5%. In the GEMINI study, the primary workflow could enable up to 31% workload savings without increasing the recall rate. Higher workload savings could be achieved using other implementation strategies, which also reduce recall rates without affecting the number of cancers detected through routine screening. Despite different AI systems, clinical settings, screening pathways and study methods, which limit direct comparisons, all studies show beneficial results of using AI in breast screening. The GEMINI study adds evidence to the literature by enabling multiple AI strategies to be evaluated and providing a more complete picture of the available AI implementation options. Compared to a previous retrospective study conducted at the same site, the routine CDR was higher (9.7 per 1,000 versus 8.0 per 1,000) during the GEMINI study, while the recall rate was lower (4.5% versus 5.0%). This could be explained by variation in baseline cancer incidence. Another cause may be the delays in breast screening due to the coronavirus disease 2019 pandemic, leading to longer screening intervals (>3 years). The human experts' reading practice may also have been influenced by their awareness of and participation in this study, which may have increased the center's routine double reading performance. This prospective service evaluation using AI live in the UK was limited to evaluating one AI system within a single UK region. The study controlled for radiologist-AI interaction by only releasing the AI opinion after the final routine double reading decision to the additional arbitration readers, allowing for a true reflection of the contribution of AI in increasing cancer detection. A 3-year follow-up of women in the study for interval cancers is not yet available, meaning that the actual sensitivity of the human readers and AI system could not be assessed. We plan to monitor the interval cancer rates in this cohort using the standard audit pathway in our participating screening center. Changes to the mammography imaging machines (hardware and software) were paused during the study because prior work indicated that such changes may affect AI performance. The triage workflows were simulated, meaning that human behavioral changes in response to AI use in the screening workflow could not be fully assessed. In addition, simulating the workflows implicitly assumes that changes in workload do not affect reader performance. Future studies using the optimal workflows investigated in the GEMINI study would allow further assessment of AI use in clinical practice. The AI excluded a relatively high rate of mammography examinations (10.8%) that fell outside of its intended use compared with previous studies using different AI tools. The reported performance improvements are not applicable to women ineligible for AI reading; therefore, they would appear smaller when averaged across the entire screening population, that is, all women attending screening. The number of cases the AI recalled in the GEMINI study at OP2 was higher than in the prior retrospective study (14.6% versus 13.0%). This could be due to natural variability in recall rates or an interim change in the mammography machine software version prior to the start of the study. The AI vendor recommended recalibration before the commencement of the study, but this was not feasible because of time and governance constraints. Therefore, a pragmatic approach was taken to use thresholds previously derived from retrospective data from the same site. To ensure that AI performance remains consistent after changes to the imaging systems, AI monitoring and quality assurance methods should be considered, which could include realistic breast phantoms. Future studies would be strengthened by evaluating several AI algorithms across multiple sites, exploring AI-human interactions according to reader characteristics such as experience, and assessing real-time adaptive AI thresholds. In addition, studies with longer durations and follow-up time would be able to assess the stability of workflow performance over time and whether detecting additional cancers with AI reduces interval cancer rates. The live AI-Additional Read workflow was run with a higher sensitivity OP (OP2), which enabled simulation with a higher specificity (OP1). This workflow at OP2 flagged a relatively high proportion of cases (12.4%, n = 1,345) for additional arbitration. Approximately 90% of these were read faster than, or at the same speed as, a routine mammography examination (an average read opinion takes 59 s), demonstrating that using AI in this way does not increase the overall reading time. Of the 1,345 cases flagged, only 55 were recalled and 11 cancers were diagnosed, resulting in a PPV of 20% for the AI-supported additional human review, similar to routine screening for this center (21.9%). The human readers dismissed over half of the additional arbitrations after reviewing prior mammogram images; the read time for the dismissed cases was less than for those recalled. This suggests that the senior readers performing this additional arbitration were willing and able to critically assess and quickly override the AI opinion, thereby maintaining the expected standards for screening. The simulated AI-Additional Read workflow, at the higher specificity OP, would increase workload less than OP2 (4% versus 6%; n = 896 versus 1,345), recall 18% fewer cases (n = 45 versus 55) and flag 10 (91%) of the 11 additional cancers detected. Incorporating the ability to consider prior mammograms may improve the functionality of this AI breast screening system and reduce human workload further. This evaluation of AI for breast screening has identified several ways in which AI could be used in a screening program that optimizes workload savings, CDR and the reduction of false positives while improving, or not compromising, other outcomes. These AI implementation strategies could deliver clinical and operational benefits with trade-offs that can accommodate local requirements and priorities. This is particularly important given the shortfall of radiology consultants and increasing workloads. The technical challenges of integrating AI into clinical practice and setting an appropriate threshold highlight the need for evaluation of AI before implementation, as well as ongoing monitoring. The GEMINI study shows that AI use can be tailored to the needs of clinical sites to improve service delivery.
[3]
Impact of using artificial intelligence as a second reader in breast screening including arbitration - Nature Cancer
In conclusion, this study explored the effect of introducing AI as a second reader in a double-reader workflow, crucially including the process of specialist arbitration. It showed that AI-enabled reading was noninferior to a standard two-reader workflow. It highlighted that, when replacing a second reader with AI, overall reading workload was reduced. Further development of the AI tool alongside improvement in explainability and acceptance of the tool by mammography readers could lead to the detection of cancers earlier than with two human readers. This study is part of the Artificial Intelligence in Mammography Screening (AIMS) study. The AIMS study protocol was approved by East Midlands Nottingham Research Ethics Committee (no. 22/EM/0038) and NHS England Breast Screening Programme Research Advisory Committee (no. BSPRAC_0093). The study was registered with the ISRCTN (no. 60839016). The AIMS study was funded by a National Institute for Health and Care Research (NIHR) award from the Secretary of State for Health and Social Care. An overview of the study is given in Fig. 1a-e. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Mammography images and clinical data for 50,000 women from two NHSBSP screening centers were selected from the OPTIMAM Mammography Image Database OMI-DB. The AI developer had used 10,000 cases from each screening center to select the operating point or threshold that is optimal for the recall rate for the local center. The vendor advised that they would do this before clinical implementation at any NHSBSP screening centers and so this mimicked the clinical situation. Therefore, all episodes for these women from any year were removed before selection of the study dataset. We randomly selected 25,000 women aged 50-70 years per screening center from 2016. This permitted a 3-year follow-up in 2019, to avoid any potential impact of COVID-19 on the data. At both screening centers, the interval cancers in the National Breast Screening System (NBSS) were lower than expected from the national Screening History Information Management system. Therefore, the additional interval cancers to reach the numbers reported by the Screening History Information Management system were selected from a wider year range: 2011-2018. There was a proportion of women with normal mammograms, but no subsequent normal mammogram to confirm the negative ground truth. To ensure a high-quality ground truth, women younger than 68 years with normal mammograms, but no subsequent screening episode, were replaced with women who did have a follow-up mammogram from a wider year range: 2011-2018. The women were matched by episode outcome, whether it was the first or the subsequent mammogram, and by age (for screening center 1 within ±1 year and for screening center 2 within ±3 years). Women aged 68+ years were permitted to have no follow-up screen, because they would not typically be invited back as part of national screening at this age. For screening center 2, there was also a proportion of women whose cases had been used to train the AI tool. These were also replaced with women who had not been used to train the AI tool, with the same matching criteria as above. The clinical data included pathological information and the recall or no-recall decisions by the historical first and second readers and arbitration. The locations of cancers for 94.5% of positive cases were recorded with a rectangle around the cancer or area where the cancer was later detected for those detected as interval cancers or at the next screening round. These bounding boxes were the ground-truth ROIs. For the remaining 5.5% of positive cases there were insufficient clinical data available for such annotation. AI exclusion criteria were applied for technical recalls, any study containing >4 images or <4 images, or cases with implants, resulting in 4,354 women (8.7%) being excluded. The AI tool was run on the mammography images for the women not excluded and an AI recall decision for each woman was obtained using the site-specific operating points. This included a case level decision and ROIs (bounding boxes marking suspicious areas). In addition, 44 (0.1%) cases were excluded due to insufficient or conflicting clinical information. In the human arm of the study, the workflow was based on the recall decision of historical first and second human readers. In the AI arm of the study, the workflow was based on the recall decision of the first historical human reader and the AI tool. To determine the impact of the AI tool on arbitration, the arbitration criteria at each center were applied and selected cases read in a reader study with two readers making the arbitration decision. At center 1, women went to arbitration if, for either breast, there was a disagreement between the first and second readers. At center 2, women went to arbitration if recalled by either the first or second reader or both. A flowchart of the case selection, study exclusions and allocation to arms is given in Fig. 1e. The AI system used in this evaluation was created by Google (v1.2, Google LLC) and is an updated version of the v1.0 model. This is an AI-powered, independent mammography reader product for double-read breast cancer-screening workflows. It analyses two-dimensional, full-field digital mammograms to give a normal or abnormal screening determination and highlights suspicious ROIs. The AI system has three components: (1) a global model which takes four mammograms and produces a case-level prediction; (2) a detection model which detects bounding boxes of lesions for each view; and (3) a hybrid model which takes as input the features from the last layer of the global model and the bounding boxes from the detection model to produce a score for each bounding box. The final case-level cancer prediction is the maximum score of the bounding boxes for that case. The AI system outputs DICOM images with bounding boxes with scores above the operating point shown; however, case-level or bounding box scores are not displayed to the user. Data from 76,142 women (63,918 from the UK, 12,224 from the USA) were used to train the AI tool. Among all the studies, 88.7% were with Hologic images, 9.6% GE images, 0.9% Siemens images and 0.8% Philips images. The exclusion criteria of the AI tool include technical recalls, cases containing more or less than four images and implants. The four-image limitation is due to design of the AI tool where it processes one image for each of the four mammogram views (that is, left craniocaudal, left mediolateral oblique, right craniocaudal and right mediolateral oblique) for a complete analysis (no missing view allowed) and, when multiple images of the same view are present, it defers the selection of that image to the operator. Nine radiologists from center 1 and nine radiologists and four consultant radiographers from center 2 participated in the reader study. All were NHSBSP-accredited mammography readers, with between 3 years and 36 years of experience (mean = 13.5 years) and reading between 2,300 and 15,000 examinations per year (mean = 6,000). Only one reader had experience in using AI previously. All readers were provided with an information pack and completed a consent form before the study. All readers completed training in interpreting the AI tool. This was provided by the AI vendor to mimic what would happen clinically and included a 10-min video explaining the AI tool and 100 cases that showed the AI decision and the ground truth -- cancer or no cancer -- and the location of any cancers. These training cases were from a screening center not included in the study. In addition a pilot study was performed by the research team. This included 28 cases, to train the readers in how to use the viewing software (RiViewer) and to test the entire process before the main study, including paperwork generation, hanging protocol, clarity of questions asked and timing. There was no overlap between the pilot study cases and the cases used in the main study. Clinically, when making arbitration decisions, the arbitration panel can view the decisions of the first and second readers on the NBSS and the clinical paperwork, where the readers have written their opinion and/or drawn areas of suspicion on a diagram. It was not possible to show the readers in this study the original paperwork or NBSS because this would show the original arbitration decision and the data would not be anonymized. To overcome this, research radiographers laboriously transcribed the original first and second reader opinions and diagrams of suspicious regions to create an anonymized copy of the study paperwork blinded to the screening outcome. For cases in the human arm, the study paperwork contained the opinions of both first and second readers. For cases in the AI arm, the study paperwork contained only the first human reader opinion (the second reader is AI). Batches of ten cases for arbitration were reviewed by pairs of readers. As these sessions were outside of working hours, the pairing depended primarily on working patterns. This mimics the clinical situation, where readers working at the same time arbitrate together. The pairs were not fixed, to allow for flexibility around clinical and personal commitments. The pair reading each batch was recorded. The reading took place on clinical workstations at the screening center using RiViewer software in a reading room with normal clinical conditions, including low lighting and high-resolution monitors. The proportion of AI arm and human arm cases within a batch was based on the proportion in the entire dataset at that center. The readers saw the study images (termed 'current images'), the images of the immediately prior screening round if there was one and, for the AI arm, the images produced by the AI model with any areas of concern annotated. For both arms, the paperwork was shown after the readers had looked at the current images and prior images. For the AI arm, the study paperwork was shown at the same time as the AI images, so that, for images in the AI arm, the readers saw the human and AI decisions at the same time. For both arms, the readers had to complete a whole loop of a defined hanging protocol before they could make a decision for that case. It was not possible to blind the AI arm to the readers because the AI output was overlaid on images and the human readers' decisions on paperwork. However, this is clinically realistic because it is how the images would be read clinically For all cases readers were asked to provide the Royal College of Radiology 5-point scale M-score (M1, no recall; M2, no recall; M3, recall; M4, recall; and M5, recall) for each breast, and the breast density Breast Imaging Reporting and Data System (BIRAD) categories A-D. For cases with prior imaging, they were asked additionally whether the priors changed their recall decision. For cases in the AI arm, they were also asked whether they were satisfied with the AI assessment of the case. If recalling a case, the reading pairs were asked to draw a bounding box around the areas being recalled and provide the conspicuity, lesion type and suspicion of malignancy. The readers were asked to draw a rectangle around the region in both views. Each region has an ID and, if they saw it in both views, they linked the region with the same ID. Collection of all clinical data and images was automated and the images and data fields were not altered during collection. This ensured that the data were clinically relevant and representative. The reader study required study paperwork to be transcribed from clinical paperwork by research radiographers. The trial manager checked that the clinical paperwork had been correctly transcribed for 1% of the study paperwork during on-site monitoring. The data entered by the readers from the reader study were checked fortnightly with automated scripts, for any inconsistencies or incomplete data. These data checks were outlined in a data management plan at the start of the study. Participants completed online surveys before, during and after the study. This included relevant questions from the NASA Task Load Index, trust and general impressions of the AI tool. Results in this paper are shown for the surveys after the study. A positive case is a woman diagnosed with cancer within 39 months of the screening mammogram used in the study, based on pathological information. This therefore includes screen-detected cancers, interval cancers and screen-detected cancers detected at the next screening round. A negative case is a woman whose mammograms used in the study resulted in an outcome of normal, with routine recall to screening 3 years later and the follow-up mammograms from 24 months onward also resulted in an outcome of normal with routine recall to screening 3 years later (age <68 years only) The mammograms of all the positive cases were annotated by expert radiologists or consultant radiographers who did not participate in the study. They drew a rectangular ROI tightly around each lesion. They then described the radiological appearance of the lesion (mass, distortion, asymmetry, calcification), whether the lesion was malignant or benign and the conspicuity of the lesion on a three-point scale (very subtle, subtle or obvious). Conspicuity was defined as how visible the lesion was in the image, in the annotator's judgment. For interval cancers and next-round cancers, the cancer was annotated on the diagnostic image (where available) and, in addition, the location the cancer would have been as annotated in the prior image. Descriptive analysis was used to summarize study population characteristics. Frequencies and percentages were calculated for categorical data. A χ test was used to compare proportions of characteristics between included and excluded groups. Our primary endpoint was noninferiority (prespecified 5% absolute margin) of the AI arm for sensitivity and specificity at the case level, compared to the human arm, measured against a 39-month ground truth. Statistical testing was performed using one-sided tests at the 0.025 significance level (after correcting for multiple comparisons using Holm-Bonferroni). CIs on the difference were Wald's intervals and Wald's test was used for noninferiority. Both used Obuchowski's variance estimate. If noninferiority was shown, a one-tailed superiority test was planned to follow without loss of power or requirement for multiple testing. Superiority comparisons were conducted using Obuchowski's extension of the two-sided McNemar's test for clustered data. Clusters were defined to group arbitrations read by the same reader pair. For case-level analysis the highest RCR M score for each breast was used. The data met the requirements of the paired binary tests used (Wald's and McNemar's tests). Case-level secondary analysis included positive predictive value (PPV), negative predictive value (NPV), cancer detection rate (CDR) and recall rate (RR). For PPV and NPV, CIs on the absolute values, differences and CIs on difference were calculated by bootstrapping. For CDR and RR, differences were calculated using Obuchowski's extension of the two-sided McNemar's test for clustered data. For CDR and RR, Wald's CIs were calculated with Obuchowski's clusters based on reader pairs. Case-level subgroup sensitivity and specificity were calculated by type of screen, age, ethnicity, X-ray system manufacturer, IMD and breast density. In addition, subgroup sensitivity was calculated by cancer type, cancer grade, lesion characteristic and lesion size. The age was taken from the NBSS. The grouping of age (50-54 years, 55-59 years, 60-64 years, 65-70 years) used as subgroups was as reported in published NHSBSP statistics. The ethnicity was taken from the NBSS. The grouping of ethnicities (white, mixed, Asian, black, other, not specified) as subgroups was based on the NHS Data Dictionary ethnic categories (https://www.datadictionary.nhs.uk/data_elements/ethnic_category.html). The IMD 1-10 (as defined in https://www.gov.uk/government/statistics/english-indices-of-deprivation-2019) was calculated from lower layer super output area data before de-identification. Breast density values (BIRADS 1-4) were calculated for mammograms acquired using Hologic devices with software developed by Royal Surrey. The breast density subgroups were the categories from BIRADS, 5th edn. X-ray manufacturer values (Hologic and Siemens) were taken from the DICOM header of the mammography images. The screen type (first or subsequent screen) was taken from the NBSS. The subgroups used were as in NHSBSP statistics. The cancer type (invasive or in situ) was taken from the NBSS. These subgroups are reported in published NHSBSP statistics. The invasive grades (1, 2 and 3) and in situ grades (low, intermediate and high) were taken from the NBSS. The subgroups were based on the NHS Data Dictionary tumor grades for breast screening (https://archive.datadictionary.nhs.uk/DD%20Release%20June%202023/attributes/tumour_grade_for_breast_screening.html). The lesion type was obtained by an expert radiologist annotating the cancers; if that was not possible due to the diagnostic images not being available, the lesion type was taken from the NBSS. The invasive lesion size (small, <15 mm, and large, ≥15 mm) was taken from the NBSS. The subgroups used were as in NHSBSP screening statistics. As the study was not powered for subgroup analysis and there were no prespecified subgroup endpoints, these subgroup analyses should be considered exploratory and hypothesis generating. We therefore present unadjusted CIs for subgroup differences to describe observed trends and magnitudes of effect within subgroups. It is important to note that these CIs should be interpreted cautiously due to the lack of power and the increased risk of false-positive findings associated with multiple subgroup comparisons. No formal hypothesis testing or multiplicity adjustments were conducted for these exploratory subgroup analyses. Case-level CIs were calculated using Wald's CIs for groups of >50 cases and, for groups of <50 cases, bootstrapping was used. Finally, localization analysis of the bounding boxes drawn during arbitration was performed using the RJafroc package v2.1.2 in RStudio v4.3.3 (ref. ). A correctly localized lesion was defined as the overlap between the ROI drawn at arbitration and the corresponding ground-truth ROI having an intersection over union value ≥0.1. All intersections over union values <0.3 were reviewed by a radiologist who did not participate in the reader study and the hit-or-miss decision was changed accordingly. For human factor analysis, perceived task load differences for the human and AI arm were analyzed using Wilcoxon's signed-rank test. Other questions, such as those on trust and general impressions, were examined using descriptive statistics for closed-ended questions and open-ended responses underwent dual-coder thematic analysis. For all positive cases where the AI correctly recalled but human arbitration then overrode the decision, we checked whether the AI ROI had correctly localized the ground-truth ROI. In addition, the average number of false-positive prompts per case were calculated for: all cases, positive cases, negative cases and positive cases where the AI correctly recalled but human arbitration then overrode the decision. For the positive cases, 2 × 2 tables of outcomes for the human and AI arms were provided for all positive cases (Supplementary Table 1), screen-detected cancers only (Supplementary Table 2) and negative cases only (Supplementary Table 3). We powered the study by simulating a two-arm, within-case design (routine versus AI assisted), where each case is read under both regimens and the primary analysis is a matched-pair Wald's test for noninferiority on sensitivity (specificity was expected to be amply powered, given the low prevalence). We assumed identical underlying performance in both arms: latent continuous scores with area under the curve of 0.90, binarized at a common threshold to yield 73% sensitivity and specificity using 39-month outcomes. Between-arm correlation was modeled via an agreement parameter set to 84.5%, matching previously observed R1-R2 concordance on positives. We modeled the two site-specific arbitration protocols (R1 | R2 and R1 ≠ R2) and powered the study using a worst-case scenario that combined the R1 | R2 arbitration style, consensus panel recall of 0.73 and agreement between arms of 0.70. Under these assumptions, 275 cancer-positive cases exceeded 90% power, whereas 200 positives provided 80% power. We therefore targeted a minimum of 200 positive cases per site to achieve 80% power. Assuming a population prevalence of cancer, this corresponded to 25,000 cases per site. Randomization is not applicable to this study because it was a retrospective study and all clients were in both arms of the study. As described above, it was not possible to blind the AI arm to the readers because the AI output was overlaid on images and the human readers' decisions on paperwork. However, this is clinically realistic because it is how the images would be read clinically. The data met the requirements of the paired binary tests used (Wald's and McNemar's tests). The data exclusions were defined before the study. From the 50,000 women, 4,354 (8.7%) were excluded due to being within the AI exclusion criteria (technical recalls, cases containing ≥4 or ≤4 images and implants) and 44 (0.1%) cases were excluded due to insufficient or conflicting clinical information. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
[4]
AI model flags hidden breast cancers years before diagnosis in routine mammograms
By Hugo Francisco de SouzaReviewed by Susha Cheriyedath, M.Sc.Mar 9 2026 A large NHS screening study shows that artificial intelligence can detect subtle signals in "normal" mammograms that reveal which women are most likely to develop aggressive interval cancers years before they appear. Study: Performance of breast cancer risk prediction algorithms across mammography systems in the UK screening programme. Image Credit: CameraCraft / Shutterstock In a recent study published in the journal npj Digital Medicine, researchers conducted a large-scale (n = 112,621) retrospective validation study to evaluate the performance of four state-of-the-art Deep Learning (DL) algorithms for predicting "interval cancers". These cancers account for approximately 30% of cancers diagnosed after a negative screening mammogram but before the next scheduled screening examination in screening programs and represent a critical diagnostic gap in current mammogram-based screening approaches. The study's findings revealed the academic DL model Mirai (developed by MIT) as the best-performing model (interval cancer AUC = 0.77). The model identified about 27.5% of interval cancers in the study cohort by flagging the top 4% of "normal" (negative) screening mammogram images as the highest risk. While the study noted that model performance varied slightly across the specific machines used to produce mammogram images and that one algorithm showed statistically significant differences between systems, these findings suggest that DL tools could potentially support risk-stratified breast cancer screening strategies, although prospective clinical evaluation would be required before implementation. Background: The Challenge of Interval Breast Cancers For decades, breast cancer screening recommendations have involved women receiving a mammogram once every few years (e.g., every 3 years in the United Kingdom [UK]). However, a growing body of evidence suggests that while these periodic screenings are necessary and effective at detecting most breast cancers, they fail to identify "interval cancers", cancers diagnosed after a negative screening mammogram but before the next scheduled screening. These "hidden" cancers, which are observed to develop or become clinically apparent in the periods between screening schedules, are often significantly more aggressive than those detected in routine mammograms, leading to worse prognosis and clinical outcomes, including death. Traditional approaches to addressing interval cancers have involved clinicians attempting to predict individual risk via genetic assessments (such as polygenic risk scores, which are not routinely implemented in most population screening programs) and family history evaluations (often incomplete). However, recent advances in Deep Learning (DL) algorithms have led researchers to hypothesize that these Artificial Intelligence (AI) models, trained on millions of mammogram images, may be able to recognize subtle imaging patterns and tissue characteristics in breast tissue that human radiologists might overlook. Unfortunately, given the wealth of commercial and academic DL models currently available, clinicians do not yet know which model to choose and whether these tools can perform well enough to be included in personalized care. Study Objective and Model Comparison The present study aimed to address this knowledge gap by conducting a head-to-head comparison of the breast cancer predictive performance of four of today's most advanced DL models: Mirai (MIT), iCAD ProFound AI Risk (a commercially available model), Transpara Risk (another commercially available DL tool), and Google Health's Risk Model. Validation Dataset From the UK NHS Screening Program These models were provided with an extensive retrospective validation dataset from the UK's National Health Service (NHS). The dataset comprised high-resolution "negative" (cancer-free) screening mammograms (n = 112,621) collected between 2014 and 2017 from two distinct NHS screening sites. Model performance was validated by tracking participants for five years to observe which women eventually developed breast cancers (approximately 1,225 cancers across the follow-up period), including interval cancers. Evaluation Across Mammography Hardware Platforms To evaluate the generalizability of algorithm performance across different mammography hardware platforms, DL models were trained on mammography images from different hardware ecosystems, specifically machines from Philips and GE. Predictive Performance of Deep Learning Models The study findings revealed that the academic algorithm Mirai consistently demonstrated the highest predictive power (Area Under the Curve [AUC] = 0.72; p < 0.001). While iCAD (AUC = 0.70), Google (AUC = 0.68), and Transpara (AUC = 0.65) achieved lower scores, their predictive performance was still notable given that the input mammograms had previously been interpreted as "normal" during routine screening. Identification of High-Risk Patients for Interval Cancers Study observations indicated that these models could identify future interval cancers from screening examinations initially interpreted as negative (Mirai's interval cancer AUC = 0.77). When researchers tested the top 4% of women identified by Mirai as being "highest risk," about 27.5% of all interval cancers in the cohort occurred within this high-risk group during follow-up. Expanding this high-risk group to the top 14% of women was observed to double the interval cancer detection yield, capturing approximately 50.3% of all future interval cancers in the cohort. Performance Across Mammography Machine Manufacturers The study also evaluated whether algorithm performance differed across mammography machine manufacturers. Researchers found that three of the four evaluated models performed statistically similarly on images generated by Philips and GE machines. While the Transpara model performed better on images generated by GE machines than on those generated by Philips machines, the difference was relatively modest (AUC = 0.69 versus 0.62). The researchers also highlight several limitations, including the exclusion of mammograms with implants or non-standard imaging views, incomplete ethnicity data, and the possibility that results may not fully generalize to mammography systems from other major vendors. The authors also note that retrospective validation may underestimate the potential clinical utility, since some cancers might be detected through additional imaging pathways rather than solely through symptomatic presentation. Conclusions: Toward Risk-Stratified Breast Cancer Screening The present study provides evidence suggesting that DL models can identify previously unrecognized imaging signals from standard mammograms to predict future cancer risk. Models such as MIT's Mirai were shown to identify and flag a significant proportion of interval cancers in a small group of high-risk women. Future work should aim to investigate these results in prospective clinical trials and real-world screening settings before such tools can be integrated into personalized screening protocols. Journal reference: Rothwell, J., et al. (2026). Performance of breast cancer risk prediction algorithms across mammography systems in the UK screening programme. npj Digital Medicine. DOI, 10.1038/s41746-026-02507-7, https://www.nature.com/articles/s41746-026-02507-7
[5]
Breast cancer detection 'up by 10% with use of AI' - study
Breast cancer detection can be improved by more than 10% with the use of an AI tool, according to the results of a new study. The evaluation was led by the University of Aberdeen following an NHS Grampian project. The team assessed how the AI software could be used to support healthcare workers in the routine breast screening of more than 10,000 women, who could also then be notified of the results more quickly. Yvonne Cook, from Aberdeen, who is in her 60s, had opted in to the AI research - and breast cancer was detected and then treated. "I just feel incredibly lucky," she said. The study's findings will now be expanded as part of a further trial looking at the use of AI in breast screening at sites throughout the UK. The AI tool, called Mia, has been developed by medical technology firm Kheiron. It can flag possible small and hard-to-spot areas of concern on mammogram scans that can be missed by the human eye. The breast cancer screening study, published in the Nature Cancer journal on Tuesday, found it could increase detection by 10.4%. It also found it could reduce staff workload, and cut the time to notify the women affected. The research team described the findings as "hugely significant" as earlier detection enables earlier treatment, and, in turn, a greater likelihood of treatment success. Yvonne went to what she thought would be a routine mammogram appointment in 2023. In the waiting room she noticed a sign explaining that a project was under way involving AI to assist in reviewing mammograms, and participation was optional. "It didn't occur to me for a minute to opt out," she said. "I think it said that AI would be utilised as part of the research project to review the mammogram and I just thought, why not?" A short time later, she received a recall letter requesting additional imaging. "I guess they don't want to alarm people unnecessarily, the letter said they wanted to do a follow-up mammogram which might be as a result of the initial result not being particularly clear. "When I arrived for that appointment, they said that it was the AI part of the analysis that had picked up something. "I had a scan and the consultant confirmed that the AI diagnosis was correct, that there was a small, Grade 2 tumour there, too small to be detected by the human eye." She added: "Overwhelmingly, I just felt incredibly lucky that I was part of the research programme and that it had been picked up at this early stage." Yvonne was immediately put on medication to inhibit the growth of the tumour, followed by surgery. "Had the AI not picked up the small tumour when it did, then either it would have been discovered at my next routine mammogram three years later, or I would have picked it up when it had grown to a stage that I was able to feel it," she said. "If that had been the scenario, then it's likely that the surgery would have been more invasive. "The cancer could have spread, it could have involved chemotherapy and a much longer recovery time with more impact on my life." Prof Gerald Lip, clinical director for breast screening in the north east of Scotland, said the results showed that AI could "effectively support" services by increasing cancer detection and reducing workload. "The bottom line here is without AI, doctors would not have caught these cancers as early," he said. "The translation of AI into clinical practice is one of the operational challenges in the coming decade. "Our findings will inform the conversation around using AI in healthcare."
[6]
How AI can improve breast cancer detection in the UK
Breast cancer affects one in every eight women in the UK. In this fight, early detection is crucial to giving people the best chance of overcoming the disease. New research from Google, Imperial College London and the UK's National Health Service (NHS), published as a pair of studies in Nature Cancer today, marks a turning point in screening technology and reveals how AI can strengthen early detection efforts. Our experimental research AI system identified 25% of the "interval cancers" that were previously missed -- the cases that typically slip through traditional screenings and only surface after symptoms appear, when they become more challenging to treat. But this research goes beyond the accuracy of the scans. It offers a first-of-its kind, large-scale look at how radiologists react when AI challenges or confirms their diagnosis in a clinical setting. In the UK's NHS, the frontline of breast cancer screening relies on a rigorous "double-reading" process: Two specialists must agree on every mammogram, with an arbitration panel deciding any disputes. It is a vital safety net, but one that's stretched to its limit. Each specialist must review roughly 5,000 scans annually, with just four hours of dedicated time per week, all amidst a global shortage of radiologists. We set out to determine how AI could help to tackle this challenge. The first step was comparing the accuracy of AI-based mammography interpretation to that of expert radiologists. We tested this by using AI to review the mammograms of 125,000 women, and the results were definitive: The AI-based screening detected 25% of the total interval cancers (cancers detected between scans) previously missed. AI also identified more invasive cancers and more cancers overall than the expert radiologists, and identified fewer false positives for women having their first-time scan.
Share
Share
Copy Link
Major UK studies show AI integration into breast cancer screening increases detection rates by 10.4% and identifies up to 27.5% of interval cancers before they become visible. The technology also reduces radiologist workload by up to 44% while maintaining accuracy across diverse populations, marking a shift in how screening programs could operate.

AI has demonstrated the ability to detect breast cancer more effectively than human readers in multiple large-scale UK studies involving over 125,000 women. The technology achieved a cancer detection rate of 9.33 per 1,000 women compared to 7.54 per 1,000 for first human readers, representing a 10.4% increase in cancer detection
1
. This superior performance was maintained across five breast screening services from the NHS (National Health Service), despite varying clinical workflows and patient populations.The retrospective evaluation covered women aged 50-70 who were screened between 2015 and 2016, with the AI system achieving superior sensitivity and noninferior specificity compared to first reader, second reader, and consensus decisions after arbitration
1
. The diagnostic accuracy remained consistent across different mammography equipment from Hologic, GE, and Siemens, demonstrating the technology's adaptability to existing infrastructure.One of the most significant findings involves the potential to reduce workload for radiologists while maintaining or improving cancer detection. The GEMINI study showed that AI integration into breast cancer screening could reduce workload by up to 44% when used in a triage workflow, where AI serves as a second reader for a subset of cases
2
. A primary AI workflow demonstrated workload savings of up to 31% without increasing the number of women recalled for further investigation.In Denmark, where AI is routinely used in one region, workload has been reduced by 33.5% and recall rates by 20.5%
2
. The technology enables single human reading for cases where AI can confidently serve as the second reader, addressing critical staffing challenges facing screening programs. Different implementation strategies allow sites to prioritize based on their specific needs, whether detecting more cancers, reducing recalls, or saving on workload.Perhaps most striking is AI's ability to identify interval cancers—those diagnosed after a negative screening mammogram but before the next scheduled screening. These cancers account for approximately 30% of cancers in screening programs and are often more aggressive with worse prognosis
4
.The academic model Mirai, developed by MIT, achieved the best performance with an interval cancer AUC of 0.77, identifying about 27.5% of interval cancers by flagging the top 4% of "normal" mammogram scans as highest risk
4
. The AI system correctly identified 25.0% of future interval cancer cases, with 88.0% localized to the correct breast and 58.1% to the precise lesion1
. For cancers only identified at the subsequent screening visit three years later, AI correctly flagged 25.1% of cases.The practical implications became clear through individual patient experiences. Yvonne Cook, a woman in her 60s from Aberdeen, had breast cancer detected through the AI tool Mia during what she thought would be a routine mammogram in 2023
5
. The AI flagged a small, Grade 2 tumor that was too small to be detected by the human eye. "Had the AI not picked up the small tumour when it did, then either it would have been discovered at my next routine mammogram three years later, or I would have picked it up when it had grown to a stage that I was able to feel it," she explained5
.The University of Aberdeen study involving more than 10,000 women found that AI could also reduce the time to notify women of results
5
. Prof Gerald Lip, clinical director for breast screening in northeast Scotland, stated: "The bottom line here is without AI, doctors would not have caught these cancers as early"5
.The AI demonstrated consistent performance across multiple demographic subgroups, including age, index of multiple deprivation, ethnicity, and breast density, with no notable differences compared to first human readers
1
. Sensitivity and specificity remained within acceptable margins across most groups, with AI particularly excelling for women over 65 years of age.The technology showed a preference for detecting higher-risk cancers. Compared to the first reader, AI achieved higher sensitivity for higher-risk cancers (0.55 versus 0.44) and noninferior sensitivity for lower-risk cancers
1
. For invasive cancers alone, the AI system achieved superior sensitivity compared to first, second, and consensus decisions (0.54 versus 0.43, 0.46, and 0.46, respectively)1
.Related Stories
The AIMS study, which included 50,000 women from two NHS breast screening centers, explored AI as a second reader including the arbitration process
3
. The study showed that AI-enabled reading was noninferior to a standard two-reader workflow, with overall reading workload reduced when AI replaced a second reader3
.Multiple workflow options exist for healthcare settings. A "triage negatives" workflow would reduce workload by 37% and recall rates by 12.2% while identifying all cancers detected by routine screening, with positive predictive value increasing by 13.8%
2
. These flexible implementation strategies allow screening programs to align AI deployment with their operational goals and capacity.A head-to-head comparison of four advanced models—Mirai (MIT), iCAD ProFound AI Risk, Transpara Risk, and Google Health's Risk Model—revealed varying performance levels across a dataset of 112,621 mammogram scans
4
. While Mirai achieved the highest AUC of 0.72, all models demonstrated notable predictive performance on mammograms previously interpreted as "normal" during routine screening.The findings will inform expanded trials examining AI use in breast screening at sites throughout the UK
5
. However, limitations remain, including the need for three-year follow-up data to assess actual sensitivity, the exclusion of 8.7-10.8% of cases that fell outside AI's intended use, and questions about how radiologist behavior might change with AI support in live clinical settings2
3
. Further development in explainability and acceptance by mammography readers will be essential for widespread adoption.Summarized by
Navi
1
Science and Research

2
Technology

3
Policy and Regulation
