Another advantage of LLMs to CAD models is that their extensive and robust medical knowledge can be leveraged to provide interactive explanations and medical advice as we illustrate in Fig. 1b. For example, based on an image and generated report, patients can inquire about appropriate treatment options (left panel) or define medical terms such as "airspace consolidation" (middle panel). Or with the patient's chief complaint (right panel), LLMs can explain why such a symptom happens. In this manner, patients can gain a deeper understanding of their symptoms, diagnosis, and treatment more efficiently. It can efficiently help patients to reduce consultation costs with clinical experts. As the performances of CAD models and LLMs become increasingly improved and these models can be jointly trained in the future, the proposed scheme has the potential to improve the quality of radiology reports and enhance the feasibility of online healthcare services.
In this paper, we evaluate the performance of the combination of a report generation network (R2GenCMN) and a classification network (PCAM). The result is compared to the baseline R2GenCMN, CvT2DistilGPT2, and PCAM. On the basis of clinical importance and prevalence, we focus on five kinds of observation. Three metrics, including precision (PR), recall (RC), and F1-score (F1), are reported in Table 1.
The strength of our method is clearly shown in Table 1. It has obvious advantages in RC and F1, and is only weaker than R2GenCMN in terms of PR. Our method has a relatively high Recall and F1-score on MIMIC-CXR dataset. For all five kinds of diseases, both CvT2DistilGPT2 and R2GenCMN show inferior performance to our method concerning RC and F1. Specifically, their performances on Edema and Consolidation are rather low. Their RC values on Edema are 0.468 and 0.252, respectively, while our method achieves the RC value of 0.626 based on GPT-3. The same phenomenon can be observed in Consolidation, where the first two methods hold the values of 0.239 and 0.121 while ours (GPT-3) drastically outperforms them, with the RC value of 0.803. The R2GenCMN has a higher PR value compared to our method on three of five diseases. However, the cost of R2GenCMN's high performance on Precision is its weakness in the other two metrics, which can lead to biased report generation, e.g., rarely reporting any potential diseases. At the same time, our method has the highest F1 among all methods, and we believe it can be the most trustworthy report generator. The other strength of our method lies in its scaling performance.
It is worth noting that our proposed ChatCAD framework significantly outperforms both R2GenCMN and PCAM. This superior performance can be attributed to ChatGPT's advanced reasoning capabilities, which effectively synthesize information from multiple sources to produce a more comprehensive and accurate report. We believe this phenomenon further underscores the superiority of ChatCAD and demonstrates its considerable potential for clinical applications. It would also be beneficial to explain the results from the perspective of continual learning to provide deeper insights for our readers. Unlike R2GenCMN and PCAM, which are trained solely on the MIMIC-CXR and CheXpert datasets respectively, ChatCAD benefits from sequential learning on MIMIC-CXR, CheXpert, and additional datasets used to train the large language model (as shown in Table S1 in Supplementary). This large language model acts as a general interface, integrating knowledge from these diverse datasets while avoiding catastrophic forgetting. In summary, the improvements in accuracy of ChatCAD over the baselines could be attributed to both the enhanced methodology and the broader access to training data.
The process of ChatCAD, as shown in Fig. 1a, is a straightforward procedure consisting of the following steps: Firstly, examination images, such as X-rays, are inputted into pre-trained CAD models to obtain results. These results are then transformed, often in tensor format, into natural language. Next, language models are employed to summarize the findings and establish a conclusion. Additionally, the results obtained from the CAD models are utilized to facilitate a conversation regarding symptoms, diagnosis, and treatment. In order to investigate the impact of prompt design on report generation, we have developed prompts, which are depicted in Fig. 3.
Reports generated from Prompt #2 and Prompt #3 are generally acceptable and reasonable in most cases as one can observe in Fig. S1 and Fig. S2 in Supplementary. "Network A" is frequently referenced in the generated reports. Some prompt tricks, e.g., "Revise the report based on results from Network A but without mentioning Network A", can be applied to remove its mention. We do not utilize these tricks in current experiments.
Different from ChatGPT, which can only be accessed via online request, language models such as LLaMA can be used and fine-tuned in local computers without data privacy issues. To evaluate generalizability of ChatCAD and also to validate its potential value in clinical practice, we have experimented with a range of LLMs, including LLaMA-1, LLaMA-2, and several others. The results of our experiments are presented in Table 2, which compares F1-scores of different LLMs, including general-purpose models, specialized medical models, and OpenAI's GPT variants. As indicated in the table, there are notable variations in performance across different conditions and model architectures, providing valuable insights into the suitability of each model for the ChatCAD framework. It is noteworthy that GPT-3 (175B) does not achieve the best performance according to the macro-average of F1-score, which means that a smaller LLM such as LLaMA-2 (13B) is capable enough to assist the process of diagnosis following our proposed ChatCAD.
Since GPT models are continuously updated, we here also demonstrate the evolving capabilities of LLMs within the ChatCAD framework. We include the latest available versions, namely GPT-3.5 Turbo and GPT-4, released in November 2023. The results of ChatCAD using different GPT models, denoted by different model generations and release dates, are presented in the bottom of Table 2.
Although the F1-scores for the latest GPT-3.5 Turbo model suggest a slight decrease in performance on average compared to its larger predecessors, it is still comparable to the best the open source model (LLaMA-2 as shown in Table 2) and offers several practical advantages. Notably, it is smaller, costs less, and responds faster. The GPT-3.5 Turbo's lower F1-scores relative to its larger GPT-3 and GPT-3.5 counterparts can be attributed to its design optimization for increased speed and cost-effectiveness. These optimizations involve a reduced parameter count, which may curtail the model's capacity to intricately process the detailed information such as medical data. Furthermore, the model's tuning may favor responsiveness over the specialized depth needed for medical report generation. Despite this, GPT-3.5 Turbo remains a viable option for applications where efficiency and affordability take precedence, and the trade-off in performance might be considered acceptable for certain real-world scenarios.
In the case of GPT-4, our experiments have indicated a noticeable enhancement in performance compared to all previous models, including the GPT-3 family. This improvement may stem from several advancements.
In a clinical setting, there are more aspects than the above-mentioned classification metrics that need to be evaluated. As a result, we have carefully developed an experimental pipeline to evaluate clinical reports generated by our proposed ChatCAD from two perspective: conciseness and appropriateness. Conciseness is vital to ensure the report being succinct and focused, avoiding extraneous details that may detract from the primary clinical message. Appropriateness measures whether the content is relevant and clinically pertinent to the case at hand. These aspects are crucial for clinicians who rely on precise and targeted information to make informed decisions quickly.
Incorporating the experimental pipeline demonstrated in Supplementary Information into our study design (Fig. S3), we have structured an experiment where each clinical expert is asked to evaluate 100 individual cases. These cases are constructed from the MIMIC-CXR dataset, with each image being paired with two types of reports: one generated by ChatCAD and another authored by a radiologist. The reports, coupled with their respective images, are merged and shuffled to ensure that each expert's assessment is unbiased and based solely on the quality of the reports concerning the medical images. We have instituted a 5-point Likert scale system (as demonstrated in Fig. S4 in Supplementary), to quantify the evaluations systematically. This scale will range from 1 (significantly lacking), 2 (needs improvement), 3 (adequate), 4 (above average), and 5 (exemplary), allowing experts to provide a nuanced assessment of each report's conciseness and appropriateness. The experts will offer both quantitative rating and qualitative feedback for each report.
The experimental results of an experienced radiologist are selected and displayed in Fig. 4. From the perspective of report conciseness, there remains a significant gap between the diagnostic reports generated by AI and those written by real doctors. Among 50 generated reports, 33 received evaluations of 3 or below, while 17 received a rating of 4, indicating that the majority of AI-generated reports still lack fluency. In contrast, the fluency of the real reports is notably higher, with more reports receiving a rating of 4 for fluency. Regarding the metric of appropriateness, ChatCAD demonstrated surprisingly impressive performance. From Fig. 4a, b, we can observe that the vast majority of AI-generated reports (39) received a rating of 4, a quantity even higher than the number of real reports (32). This highlights the advantage of ChatCAD proposed in this paper in terms of report generation. Considering conciseness, ChatCAD-generated reports scored 3.40 ± 0.67, while human-written reports obtained 3.48 ± 0.58. ChatCAD demonstrates impressive performance on appropriateness (3.84 ± 0.65), showing superior performance to human-written reports (3.58 ± 0.64).
We also demonstrate results of the identification task in Fig. 4c, f. Two subjects with different levels of exposure to AI techniques were asked to discriminate AI-generated reports from samples presented to them. Subjects with less exposure to AI showed a notable difficulty in distinguishing AI-generated reports, achieving only a 55% accuracy. This suggests a lower capability in discerning between human- and AI-generated content when compared to those with more familiarity with AI technology. In contrast, the subject with more experience in AI achieved a 73% accuracy, showcasing a clearer ability to discriminate between human-generated and AI-generated reports. The precision, recall, and F1-scores were notably higher as well, indicating more robust capacity in differentiating between the two sources. This can be further evidenced by the visualization in Fig. 4c, revealing the potential of AI-generated reports in practical clinical scenarios.
In summary, our experimental evaluation, as shown in Fig 4, has provided us with quantitative data on the conciseness and appropriateness of ChatCAD-generated reports compared to human-authored ones. While AI-generated reports may lack a degree of the linguistic fluidity typically found in human reports (evidenced by a lower conciseness score), they have demonstrated a high degree of appropriateness (p = 0.022 with paired t-test). Remarkably, the AI-generated reports received higher appropriateness scores than human-written reports in a significant number of cases.
This evidence suggests that AI-generated reports, with their traceability and consistency, could complement the work of human radiologists, potentially mitigating issues related to experience variability, stress, and fatigue. We will expand upon this discussion in our manuscript to highlight how the integration of AI in radiological reporting could not only augment the radiologist's capabilities but also introduce an element of standardization and reliability that is less susceptible to human factors.
In this section, we compare the performance of different LLMs for report generation. OpenAI provides four different sizes of GPT-3 models through its publicly accessible API: text-ada-001, text-babbage-001 (1.3 billion parameters), text-curie-001 (6.7 billion parameters), and text-davinci-003 (175 billion parameters). The smallest text-ada-001 can not generate meaningful reports and is therefore not included in this experiment. We report the F1-score of all observations in Fig. 2b. It is noteworthy that language models struggle to perform well in clinical tasks when their model size is limited. The diagnostic performances of text-babbage-001 and text-curie-001 are subpar, as demonstrated by their low average F1-scores over five observations compared with the last two models. The improvement in diagnostic performance is evident in text-davinci-003, whose model size is hundreds of times larger than that of text-babbage-001. On average, text-davinci-003's F1-score is improved from 0.471 to 0.591. The ChatGPT is slightly better than text-davinci-003, achieving the improvement of 0.014, and their diagnostic abilities are comparable. The details can be observed in Table 3. Overall, the diagnostic capability of language models is proportional to their size, highlighting the critical role of the logistic reasoning capability of LLMs. In our experiments, it can be observed that more capable models generally produce longer reports as shown in Fig. 2c. At the same time, nearly 40% of reports generated by text-babbage-001 and nearly 15% of reports from text-curie-001 have no meaningful content.
A major advantage of our approach is the utilization of LLM to combine various decisions from multiple CAD models. This allows us to fine-tune each CAD model individually and ensemble them incrementally. For instance (c.f. Fig. 5a), in response to an emergency outbreak such as COVID-19, we can add a pneumonia classification model that differentiates between community-acquired pneumonia and COVID-19 infection. This process requires very few data and thus is very flexible. For example, used 204 COVID-19 cases and reached 90% points diagnosis accuracy. The final report will then highlight the effectiveness of our approach in improving the overall accuracy and reliability of CAD systems, as well as its potential for rapid adaptation to emerging situations such as disease outbreaks. By leveraging LLM, we can seamlessly integrate new models and adjust the weighting of each model to achieve optimal performance.
The proposed ChatCAD also offers several benefits, including its ability to utilize LLM's extensive and reliable medical knowledge to provide interactive explanations and advice. As shown in Fig. 5e, f, two examples of the interactive CAD are provided, with one chat discussing pleural effusion and the other addressing edema and its relationship to swelling.
Through this approach, patients can gain a clearer understanding of their symptoms, diagnosis, and treatment options, leading to more efficient and cost-effective consultations with medical experts. As language models continue to advance and become more accurate with access to more trustworthy medical training data, ChatCAD has the potential to significantly enhance the quality of online healthcare services.