In our study, we create a dataset containing 579 surgical pathology reports, in two versions (English and German languages), with manually obtained high-quality ground truth data for relevant parameters. We study the extraction capabilities of six different models, both proprietary and open-source (GPT4, Llama3 70B, Llama3 8B, Llama2 70B, Llama2 13B, and Qwen2.5 7B) and investigate the influence of prompt engineering and weight quantization on model performance. We show that all relevant parameters can be extracted with high accuracy by both proprietary and open-source models in fully structured forms that can be used for analytical/statistical purposes and downstream AI tool development projects, albeit with substantial differences between the models and setups. This information can be highly valuable for the implementation of LLMs in pathology departments. Moreover, we publicly release the English and German versions of our dataset, which can be used as a research benchmark for further LLMs tests.
The study analyzed 579 pathology reports from 340 patients with prostate adenocarcinoma who underwent radical prostatectomy between 2020 and 2022. The reports were retrieved in anonymized plain text format from our institute's local database. Ground truth data, used to evaluate the model outputs, were manually extracted by a trained doctoral student (human medicine background) from the reports under the supervision of an attending pathologist. Annotation was based on TNM Classification of Malignant Tumors (TNM), Residual tumor (R) classification, Gleason grading system, and WHO tumor classification. Eleven parameters were chosen for the extraction based on their clinical and prognostic relevance: WHO (ISUP) Grade Group, T-Stage, N-Stage, Number of lymph nodes examined, Lymph nodes with metastasis, Resection margins, Histologic subtype, Primary/Secondary/Tertiary Gleason pattern, Percentage of secondary Gleason pattern. These parameters are essential in the staging and characterization of prostate adenocarcinoma as they guide therapeutic decisions and influence patient prognosis.
The pathology reports were translated from German to English (Fig. 1b) using the DeepL API in September 2023 (www.deepl.com). The reports underwent no additional modifications prior to data extraction rather than anonymization.
All study steps were performed in accordance with the Declaration of Helsinki. This study was approved by the Ethical Committee of the University of Cologne (20-1583). The pathological reports used in this study were obtained from patients who signed a broad informed consent (BioMaSOTA consent form), which is routinely collected from patients undergoing therapies at the University Hospital Cologne. This consent allows the use of patient data and biomaterials for academic research and development of commercial products, including data transfer within "Germany, European Union, and outside of European Union (so-called third countries)" (§4 of the consent form). The need for additional patient consent was waived as only anonymized, retrospective materials were used. When transmitting data via API to OpenAI, special attention was paid to ensure full anonymization. The transmitted data was limited exclusively to the pathological-anatomical report with macroscopic and microscopic description, individual summarized results of molecular pathological examinations, and final assessment including TNM classification, making it impossible to trace back to individual patients. Data protection aspects and risks of using cloud-based tools were extensively discussed within authors' collective.
Evaluation of GPT-4 and Llama2/Llama3 16-bit precision models (further referred to as full-weight models) ability to extract structured data from pathological reports was performed under various conditions (Fig. 1c) using a dataset of 579 radical prostatectomy pathology reports with English and German versions (see above). Experiments involving GPT-4 utilized the OpenAI API (https://platform.openai.com/). For implementation of Llama2 and Llama3 full-weight models, we used Nvidia A100 80 GB GPUs (CUDA version 117) hosted on our institute's local infrastructure. The weights for these models were obtained through the Hugging Face API, utilizing the Transformers library in Python (https://github.com/meta-llama/llama3). Model versions and the access periods are listed in Table 1. We allocated one A100 GPU for the Llama2 13B and Llama3 8B models and four A100 GPUs for the Llama2 70B and Llama3 70B models.
Interactions with the language models were performed one report at a time using a zero-shot prompting approach (Supplementary Table S1). The models were tasked to extract eleven key parameters from pathology reports and provide their answers in a structured JSON format (Fig. 1a). The responses generated by the models were automatically evaluated at the case level. 239 patient cases had two or more reports: initial with preliminary data, including preliminary TNM classification, and final that usually incorporated additional investigations such as immunohistochemistry and the information from parallel submissions to the department, e.g., lymphadenectomy specimens, resulting in the final TNM classification. In these cases, priority was given to the most recent report for analysis. If certain parameters were not addressed in the most recent report, subsequent reports were consulted sequentially until the earliest available report. If parameters remained specified as "Not mentioned" in the earliest report, this was considered as the final answer. To evaluate model performance accuracy, precision, recall, and F1-Score were calculated. Accuracy was determined as the ratio of correctly predicted instances to the total number of instances. Precision was calculated as the ratio of true positive predictions to the sum of true positive and false positive predictions. Recall was measured as the ratio of true positive predictions to the sum of true positive and false negatives. F1-Score was computed as the harmonic mean of precision and recall. True positives are the correctly extracted parameters. False positives refer to incorrectly extracted parameters where the model's output does not match any of the ground truth categories for that parameter (e.g., "pN1" instead of "Pn1"). False negatives refer to incorrectly extracted parameters where the model's output does exist in the ground truth categories but does not match the specific expected value (e.g., "pN1" instead of "pN3").
For detailed error analysis of the full-weight model output, incorrect responses were manually reviewed to identify recurrent error sources.
The frequency of hallucinations was determined by tasking the models with extracting structured data from ten randomly selected, non-malignant, German-language reports using the same zero-shot procedure described above. Since none of these reports contained information on the queried parameters, the only correct answer was "Not mentioned."
To assess the text complexity within the pathology reports, several metrics were calculated: the number of tokens, type-token ratio (TTR), and measure of textual lexical diversity (MTLD). These calculations were performed using the koRpus package in R. To increase the precision of the complexity measurements, we removed URLs, punctuation, numbers, English stopwords, and whitespaces, and transformed all letters to lowercase. Then, the parameters were evaluated for each individual report. Subsequently, the average values were computed at the patient level. To explore the relationship between text complexity and data extraction accuracy, the average values were correlated with the percentage of correct answers per patient. Correlation coefficients were calculated using Pearson's r.
For employing the 4-bit quantized versions of Llama2 13B, Llama3 8B and Qwen2.5 7B models, we used the Ollama platform, executed on a MacBook Pro M1 with 16 GB RAM (https://github.com/ollama). Model versions and the access periods are listed in Table 1. Performance evaluation was conducted as for the full-weight models. Additionally, five most popular and easy to implement alternative prompting strategies (n = 5) were selected as most promising from general language domain and explored during the analysis of quantized versions of the LLMs. To increase the proportion of correctly structured JSON files, we used the built-in JSON mode of the OpenAI API within Ollama. Again, all interactions with the language models were conducted one report at a time.
The zero-shot strategy (Supplementary Table S1) included the pathology report to be analyzed, the parameters to be extracted, the answer options and the output format (JSON file). The few-shot strategy (Supplementary Table S2) is an extension of the zero-shot approach, in which a pathology report and the corresponding JSON output are included as an example to improve performance. For the 'Rephrase' strategy (Supplementary Table S3), we tasked GPT-4 to improve the zero-shot prompt we used in previous analyses. In addition, we included several report/JSON examples for GPT-4. Following this, we refined GPT-4's output by making a single modification before incorporating it into subsequent analyses (see Supplementary Table S3). In the case of the chain-of-thought strategy (Supplementary Table S4), the models were prompted to first break down their response into constituent steps and then summarize them in JSON format. For this purpose, we added the following section at the end of the zero-shot prompt: Walk me through your answer step by step, summarizing and analyzing each category as we go. Then summarize your answer in JSON format at the end. The chain-of-verification strategy (Supplementary Table S5) was a two-step approach to reduce hallucinations by the models. First the models where tasked to extract the structured data using the zero-shot strategy. Then, in a second prompt, they were tasked to compare the output containing the JSON object from the first prompt with the original pathology report again to verify the answers given. The possible response categories were also listed again for this purpose.
The data processing and statistical analyses were conducted using the R (R Foundation for Statistical Computing, Vienna, Austria) and Python3 programming languages. R was used for the graphical visualization of the results. For further graphical illustrations, Flaticon was used additionally (https://www.flaticon.com).
GPT-4 (Open AI, https://chatgpt.com) was used via the paid chatGPT online platform for language refinements as well as spelling and grammar corrections of the manuscript.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.