Regarding malignancy diagnosis using Top-3 predictions, the rate was 12.0% (95% CI 11.8-12.1%) in Korea and 10.0% (95% CI 9.9-10.0%) globally. (Supplementary Table 4) Assuming that all malignancy diagnoses by the algorithm were false positives, the estimated specificity using Top-3 in Korea was 88.0% (95% CI 87.9-88.2%), and 90.0% (95% CI 90.0-90.1%) globally. By continent, the malignancy predictions using Top-3 was as follows: North America (11.7%), Asia (10.9%), Europe (9.3%), Oceania (9.0%), South America (8.0%), and Africa (6.7%).
Here, we present the worldwide study evaluating the real-clinical use of an open-access global dermatology AI service with 1.69 million assessments from 228 different countries. These large number of requests reflects strong user interest in dermatology AI apps and confirm the utility apps in daily clinical practice. However, evaluating the real-world performance of AI in dermatology is highly challenging because a continuous dataset linking AI results to biopsy outcomes is required. However, due to lack of population-based reference skin cancer datasets and the current lack of digitalization in clinical practice, obtaining large datasets with serial results is virtually impossible. Furthermore, evaluating 'specificity' is problematic, as curated hospital datasets fail to capture the wide range of out-of-distribution conditions encountered in real-world settings.
A substantial discrepancy exists between the performance of diagnoses based on images alone and diagnosing in actual clinical practice. For the diagnosis of 43 tumor types, attending physicians achieved Top-1 and Top-3 accuracies of 68.1% and 77.3%, whereas physicians in the reader test achieved only 37.7% and 53.4% for the same cases, respectively. Although AI has demonstrated exceptional performance in controlled settings, it must prove its effectiveness in real-world environments to be meaningful in clinical practice. Despite the large number of retrospective studies, real-world evidence remains insufficient. As of 2024, only 86 medical AI algorithms have reported RCT-level evidence, and among these, 70 achieved successful outcomes, and of these, only 1 exists in the field of dermatology.
We curated a large dataset representing cancer cases in Korea to estimate the sensitivity for cancer diagnosis. To assess specificity, we analyzed usage statistics under the assumption that all AI-determined malignancy predictions were false positives. This separated approach enabled us to estimate the algorithm's real-world performance in terms of sensitivity and specificity.
First, regarding the sensitivity, the maximum sensitivity was estimated under the assumption that false negatives are minimal if using the national-scale hospital dataset. The NIA dataset is large enough to represent skin cancer cases in Korea. Therefore, the sensitivity of 78.2% (95% CI, 77.0-79.4%), predicted by three differentials, was presented as the ideal maximum value achievable in real-world settings. If users capture only low-quality images, the sensitivity could be lower. Conversely, in practice, repeated testing of a lesion over days may increase sensitivity.
Next, regarding specificity, the minimum specificity was estimated under the assumption that there are no true negatives (= all false positives) in the usage data. The analysis of usage records in Korea showed that the algorithm predicted malignancy at a rate of 12.0% (95% CI, 11.8-12.1%) based on three differentials. Given the relatively low prevalence of cancer, if we assume all malignancy predictions by the algorithm are false positives, the specificity was estimated at 88.0% (95% CI, 87.9-88.2%). (Supplementary Table 4) This figure (88.0%) is lower than the specificity reported for the NIA dataset (93.0%; Table 1). At first glance, benign tumors requiring hospital visits may seem harder for AI to diagnose than various skin conditions seen in daily life. However, real-world use includes out-of-distribution scenarios, likely affecting the degradation of AI performance. On the other hand, actual specificity may be slightly higher than our 88.0% estimate, as users often confirm malignant results through repeated tests before hospital visits.
From the perspective of disease screening, WHO tuberculosis guidelines set minimal requirements of 90% sensitivity and 70% specificity. Similarly, the Breast Cancer Surveillance Consortium benchmarks highlight 86.9% sensitivity and 88.9% specificity for breast cancer screening. However, for skin cancer screening, no precise guidelines for the sensitivity and specificity currently exist. Furthermore, concerns about overdiagnosis in dermatologist-led skin cancer screenings highlight the need to discuss using algorithms with lower real-world performance. As seen in Fig. 2a, the higher malignancy rate on the map may be related to greater algorithm usage for skin cancer, similar to how an increase in dermatology clinics is associated with a higher regional incidence of melanoma. However, unlike human physicians, algorithms can be adjusted to operate with either very high specificity or low by modifying the threshold. For example, in scenarios where concerns about overdiagnosis, the threshold could be raised to identify only cancers with the highest diagnostic certainty, particularly in populations with limited access to healthcare. From this perspective, there should be an effort to find proper settings by each region and demonstrate the improvement of clinical outcomes (mortality, morbidity, and costs).
The usage statistics revealed significant regional differences in the types of diseases predicted by the algorithm. (Figs. 2, 3, Supplementary Table 3) Neoplastic disorders were more commonly predicted in Asia, Europe, and North America. (Fig. 2) Premalignant conditions such as actinic keratosis were observed at higher rates in Australia and North America. (Fig. 3) Interestingly, in line with these findings, users from the EU, North America, and Oceania showed higher sensitivity in the global reader test compared to those from other regions. (Fig. 1b) In contrast, infectious diseases were more frequently observed in regions such as Northern Africa and the Middle East. (Fig. 2c) These regional differences reflect variations in disease prevalence, the age demographics of users, and the diseases users are particularly concerned about in each region. For example, in South Korea, despite the low prevalence of skin cancer, high interest in skin cancer in the media may lead to a higher malignancy ratio as shown in Fig. 2a. These data, collected before patients even visit the clinic, can offer more accurate insights into the prevalence and interest in specific skin conditions. The global burden of disease study has been attempted, but only for certain disease groups, with no individual approach to dermatological diseases. This indicates that AI-based big data analytics could significantly contribute to understanding skin disease trends globally.
Our study has several limitations. The first part of the study was limited to patients with skin types III and IV, which account for the vast majority of skin types in Korea. Therefore, we were unable to provide stratified data by race and skin type, especially in the population of dark skin tones. Additional studies are required to evaluate the sensitivity for skin cancer in white and black populations.
Second, the NIA dataset does not include less common skin cancers beyond the four major types (e.g. Merkel cell carcinoma, Kaposi sarcoma). According to 20 years of incidence statistics in Korea, these other types account for approximately 11.3% of all skin cancers. In particular, early detection with AI can be beneficial for rare but very poor-prognosis skin cancers such as cutaneous angiosarcoma. Therefore, rare cancers not represented in the NIA dataset should be collected and analyzed using appropriate statistical sample size considerations to ensure sufficient power and validity.
Third, the sensitivity and specificity calculated in this study needs to be re-evaluated through further digital transformation efforts, separately for each indication. The global study lacked a gold standard in order to calculate sensitivity and specificity. Ideally, users could provide feedback on final diagnosis based on clinical follow-up, laboratory work-up or final histopathological assessment. Further validation should be tried individually in each country, tailored to the healthcare environment.
Fourth, in terms of multi-class results, the algorithm achieved Top-1 and Top-3 accuracies of 43.3% and 66.6% on the NIA dataset. Although the multi-class performance of the algorithm was validated using the large dataset, its real-world accuracy also needs to be reassessed in further studies.
Fifth, because of this being the study to evaluate real-use of global users of a single algorithm, it was challenging to determine the best approach. Since usage varies over time, it is displayed as the disease ratio within each country.
Sixth, although this was a global study, data from Africa, South America, and Oceania were relatively underrepresented, which may reflect regional differences in population size, access to digital healthcare, and overall engagement with digital health initiatives.
Finally, while the algorithm was designed to prioritize high specificity, it may be perceived as insufficient from the perspective of maximizing sensitivity. Although high sensitivity has been favored to avoid missing malignant cases, this approach often results in a high false-positive rate, as seen in the case of MelaFind. In the context of AI tools that may be used frequently by laypersons without clinical oversight, insufficient specificity could amplify overdiagnosis and unnecessary anxiety. Therefore, our specificity-focused design aligns with concerns about overdiagnosis and false alarms, particularly in low-risk populations. However, it limits the algorithm's ability to capture all true positives.
In conclusion, using a national-scale curated dataset and real-world usage data, the performance of AI in diagnosing skin cancer in Korea could be estimated at 78.2% (95% CI, 77.0-79.4%) sensitivity and 88.0% (95% CI, 87.9-88.2%) specificity. In multi-class classification, our algorithm achieved Top-1 and Top-3 accuracies of 43.3% and 66.6%, respectively, replicating the results obtained from the past small study. Furthermore, this study highlights the potential of AI algorithms to provide a global perspective on skin diseases, offering a quantitative reflection of regional variations using AI-based big data analytics. Further research is needed to identify clinical settings where AI can effectively improve clinical outcomes. Additionally, the need for randomized controlled trials (RCTs) and validation in underrepresented regions, such as Africa and South America, must be explicitly emphasized to ensure the algorithm's effectiveness and generalizability across diverse populations.