Publications investigating the application domain "medical knowledge" using LLMs in a prompt-answering principle were included in the meta-analysis. Publications were eligible if the total number of questions and an endpoint that describes the relative amount of correct answers in percentages were reported (usually reported as accuracy or sensitivity). Using a random-effects model, the I² statistic was calculated using the R-package metafor13. Due to high heterogeneity, the risk of bias or quality of evidence evaluation was not estimated. All statistical analyses were performed using RStudio (Version 2024.04.0+735). However, it is worth noting that no formal hypothesis testing was conducted. Therefore, the presented meta-analysis is of a descriptive nature.
In the conducted literature search, initially, 483 publications matched the search string. During the title and abstract screening, 373 publications were excluded. 110 publications were full-text screened, resulting in the exclusion of another 76 publications. Finally, this systematic review includes 34 eligible studies, all of which were published between January 2021 and March 2024. For a detailed overview of the study, the selection is presented as a PRISMA flow diagram (see Supplementary Fig. 1). For a comprehensive overview of the included publications in this systematic review (see Tables 2 and 3).
The majority of studies explored the prompting of medical questions and analysing the output of the LLMs (medical knowledge, 32/34), while a smaller portion explored LLMs for patient compliance (patient empowerment, 1/34) or their use for translating or summarizing information for patients (translation/summary, 2/34). The majority of the included studies focused on gynaecological cancer (8/34), prostate cancer (8/34), oropharyngeal cancer (6/34), or lung cancer (5/34), with some publications covering various cancer entities (4/34). The included publications evaluated different tasks. The two most prevalent tasks were prompting a LLM to provide appropriate diagnostic (14/34) as well as treatment recommendations (30/34).
This section elucidates the methodology and results pertaining to the use of LLMs in the domain of medical knowledge. The developed evaluation framework encompasses the assessment of sources for generating inquiries, the language model utilized, the questioning procedure followed, and the methods for evaluating outputs. The results are illustrated in Fig. 1.
Various sources were used to generate input, i.e. questions for prompting. Sources encompassed official information forums (3/34), guidelines (7/34), frequently asked questions (FAQs) from various sources, social media, hospital websites and Google Trends (8/34), clinical cases (8/34), multiple-choice questions from medical exam banks (2/34) and some authors curated questions without reporting a specific source (4/34). English was predominantly used for prompting. A median of 51 questions (ranging from 8 to 293) were prompted. 61.8% of publications included the prompts and LLM outputs either as supplementary material or within the main text (21/34).
All the studies have tested either GPT-3.5 or GPT-4. In total, 27 out of 34 studies evaluated the performance of GPT-3.5, 11 out of 34 assessed GPT-4, and 10 out of 34 examined multiple LLMs in comparison. LLMs were utilized in their standard accessible versions via their website application or through an application programming interface (API).
Liang et al. reported the use of GPT-3.5 Turbo, which is a payable service provided by OpenAI to fine-tune language models directed to specific tasks or domains. They employed a dataset comprising 80 questions related to renal clear cell carcinoma. The questions were designed with binary answers (true/false). To increase robustness, five distinct variations were added to each question, ensuring nuanced adjustments without deviating from the core essence of the original questions. Subsequently, the LLM's responses opposite to the ground truth were iteratively repeated until satisfactory outcomes were attained. By iteratively refining its performance on a foundational task set consisting of the binary designed questions, GPT-3.5 Turbo achieved an accuracy of 100% for the specific tasks.
A dedicated section within the evaluation framework thoroughly examines the reported questioning procedures (Table 1, Items 5-11). This aspect was the most under-reported in the included studies (Fig. 1). Only Holmes et al. provided a comprehensive account of their questioning procedure. Their detailed description included the number of questions, the design of questions (e.g. multiple-choice questions), test-retest cycles, and the use of prompt engineering prior to each cycle. Additionally, they provided source data along with prompt templates and initiated a new chat for each cycle to erase context.
Test-retest reliability is a common psychometric parameter in the evaluation of scoring procedures. This method was applied in 12 out of 34 of eligible studies. Authors used test-retest cycles to evaluate the reliability and variability of LLM outputs to the posed questions, typically with a specific time interval in between tests.
Publicly accessible browser-based LLMs (e.g. GPT-3.5, Copilot, Gemini) offer a "new chat" function, which initiates a new conversation, erasing prior prompts and answers. Utilizing this function for each prompt was only employed in a few studies (9/34). The authors justified this method by noting that posing multiple continuous questions furnishes an LLM with context within a conversational framework. Continuous questioning can introduce context that may alter the performance of subsequent inquiries, as LLMs are proficient in-context learning. To mitigate these potential biases, it is advisable to initiate a "new chat" when assessing the "zero-shot" performance of LLMs concerning medical knowledge.
The counterpart to the previously mentioned "zero-shot" prompting is conducting a conversational flow with a LLM. One out of 34 authors (Schulte) reported conducting a continuous enquiry of the LLM, i.e. multiple questions within the framework of a conversation.
Schulte evaluated treatment recommendations for solid tumours and justified this approach to reduce variability and ensure the collection of data within a single session. The author allowed GPT-3.5 to tabulate possible therapies and used the National Comprehensive Cancer Network (NCCN) guidelines as a reference. GPT-3.5 was able to list 77% of the total possible therapies in concordance with the guidelines. The approach used by Schulte has been previously described by Gupta et al. as continuous or "fire-side" prompting, which can leverage more detailed LLM outputs because the conversation provides context to subsequent prompts.
Prompt Engineering (PE) was reported in 9 out of 34 studies. Examples of PE were summarized in Table 2, showcasing a variety of methods described in the literature. PE focuses on enhancing the outputs of LLMs by refining prompts and has the potential to improve the accuracy and consistency of their performance. Nguyen et al. investigated the impact of providing context through PE on the performance of different prompts. The authors found that applying context through PE increased the performance of Open-Ended (OE) prompts. Other researchers have explored various prompting style strategies, focusing on rephrasing prompts without altering their original intent (e.g. "What is the treatment for [X]?" vs. "How do you treat [X]?"). Chen et al. evaluated this approach using four different paraphrasing templates, which influenced both the number of treatments aligned with NCCN guidelines and the occurrence of incorrect suggestions.
All publications included in the analysis reported an endpoint to describe the correctness or readability of LLM outputs as measures of performance. In total, 26 different terms used to assess correctness were identified (Fig. 2). The evaluation of LLMs based on the correctness of outputs was termed "grading". Grading methods and the quantity of correct LLM outputs were among the most consistently reported subjects in the literature examined (Table 1, Items 17-19). The grading methods for correctness can be summarized into three groups: binary, one-dimensional, and multidimensional methods (Fig. 2).
The most popular grading methods for correctness were Likert-scales, with 20 studies employing this approach. Among Likert-scales, the 4-point (n = 7), 5-point (n = 8) and 10-point (n = 3) scales were commonly used. The second most common was binary grading methods, where LLM outputs were graded as either correct or incorrect based on the source material (n = 8). Additionally, three publications evaluated multiple-choice questions. Only a few studies (n = 5) utilized validated tools to assess LLM outputs. For example, Pan et al. and Musheyev et al. used the DISCERN tool to compare the performance of different LLMs. DISCERN is a validated tool developed for patients to assess the quality of written medical information on treatment choices from internet sources. Lechien et al. created the artificial intelligence performance instrument (AIPI), a tool that provides a multidimensional scoring system to assess the performance of AI-generated outputs to medical questions. It was specifically designed to assess the performance of generative artificial intelligence (GenAI) systems in clinical scenarios. The AIPI was tested for reliability and validity. The score comprises subscores assessing patient features, diagnosis, and additional examination. To mitigate subjectivity, the authors refrained from implementing Likert-scales. The findings demonstrated that the AIPI is a valid and reliable clinical instrument. The AIPI was particularly validated on clinical scenarios involving the management of clinical ear, nose, and throat cases.
The performance of LLM outputs in terms of readability and understandability was assessed as a secondary endpoint by 8 out of 34 of the studies. Flesch reading ease (FRE), Flesch-Kincaid grade level (FKGL), and the patient education materials assessment tool for printable materials (PEMAT-P) were used and are established instruments to assess the complexity of the text to estimate the understandability of a patient's perspective.
Only a few studies assessed the ability of LLMs to translate their outputs into understandable language for laypersons as a primary endpoint. Haver et al. demonstrated that the initial responses of GPT-3.5 to lung cancer questions were challenging to read. To address this, simplified responses were generated using various LLMs, namely GPT-3.5, GPT-4, and Bard. LLMs succeeded in enhancing readability, with Bard showing the most improvement. However, the average readability of all simplified responses still exceeded an eighth-grade level, deemed too complex for the average adult patient.
To illustrate the substantial variance that exists in the reported performance of LLMs in medQA, we conducted a formal and explorative meta-analysis. In total, 27 studies were eligible, evaluating the individual performance of one LLM or a comparative benchmark across multiple LLMs (as shown in Figs. 3 and 4). Studies evaluating single LLM assessed either GPT-3.5 or GPT-4, with mean accuracies across all studies of 63.6% (SD = 0.23) and 78.0% (SD = 0.16), respectively (see Fig. 3). The heterogeneity across studies yielded an I² value of 0%, indicating a substantial variability. In comparative assessments involving multiple LLMs, mean accuracy rates were 79% (SD = 0.10), 73% (SD = 0.17) and 51% (SD = 0.15) for GPT-4, GPT-3.5, and Bard (LaMDA), respectively, with a calculated I² value of 21% (see Fig. 4). The set of studies revealed an I² values of 0% and 21% underscoring the immense variability between studies which makes the results difficult to interpret.