Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to the proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance. Utilizing a benchmark dataset, MedReview, consisting of 8161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the performance of open-source models was all improved after fine-tuning. The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were manifested in both a human evaluation and a larger-scale GPT4-simulated evaluation.
Medical evidence plays a critical role in healthcare decision-making. In particular, systematic reviews and meta-analyses of randomized controlled trials (RCTs) are considered the gold standard for generating robust medical evidence. However, systematically reviewing multiple RCTs is laborious and time-consuming. It requires retrieving relevant studies, appraising the evidence quality, and synthesizing findings. Meanwhile, systematic reviews frequently become obsolete upon publication, primarily due to protracted review processes. The delay is exacerbated by the exponential increase in scientific discoveries, exemplified by over 133,000 new clinical trials registered at ClinicalTrials.gov since 2020. As such, it is imperative to establish an efficient, reliable, and scalable automated system to streamline and accelerate systematic reviews.
Typically, systematic reviews include quantitative and qualitative reports, the former being a statistical meta-analysis of relevant clinical trials, and the latter being a concise narrative explanation of the quantitative results. Language generation technologies could potentially be employed to auto-generate such narratives, yet they have not been widely applied for medical evidence summarization. Text summarization has attracted research attention for decades. Earlier techniques relied on extracting key phrases and sentences through rules or statistical heuristics like word frequency and sentence placement. These methods, however, struggled with comprehending context and generating cohesive summaries. A significant shift occurred with the adoption of neural network-based methods, enhanced by attention mechanisms. These mechanisms enable the model to concentrate on various input segments and understand long-range connections between text elements. This advancement allows for a deeper grasp of context, leading to smoother and more precise summaries.
Recent advancements in generative Artificial Intelligence, notably large language models (LLMs), have shown tremendous potential in comprehending and generating natural language. While these generalist models perform well across diverse tasks, they fail to capture in-depth domain-specific knowledge, particularly in biomedicine. Furthermore, despite the relevant superior performances compared to open-source alternatives, the disadvantages of closed-source models were rarely discussed or even mentioned. The lack of transparency of closed-source models makes it challenging to understand the model behavior and troubleshoot customized variants. Moreover, reliance on closed-source models raises the risks associated with changes in service terms or discontinuation of services, which pose a critical threat to long-term projects. Open-source models provide a promising solution to mitigate the above risks.
While multiple open-source model architectures were proposed for either general-purpose foundation models or specifically for text summarization, few have been optimized to synthesize medical evidence. Gutierrez et al. compared the in-context learning performance of GPT-3 with fine-tuned BERT models on named-entity recognition tasks in the biomedical domain. Tang et al. assessed zero-shot GPT-3.5 and ChatGPT on summarizing Cochrane review abstracts. The above-related studies, however, did not focus on fine-tuning open-source models for medical evidence summarization. It's also unclear to what degree the optimization strategies, e.g., few-shot learning or fine-tuning, can help bridge the performance gap between open-source models and cutting-edge closed-source alternatives. To quantitively assess fine-tuning technologies to enhance open-source LLMs for medical evidence summarization, we experimented with three broadly used open-sourced LLMs: PRIMERA, LongT5, and Llama-2, including both architectures designed specifically for text summarization and architectures of generalist foundation models. Fine-tuning these models is challenging due to the substantial requirement for computational resources and the risk of catastrophic forgetting, a phenomenon in machine learning where the performance degrades on tasks where the LLM initially performed well. To address these issues, we employed low-rank adaptation (LoRA), which is a parameter-efficient fine-tuning method focusing on updating only a minimal amount of model parameters during the fine-tuning process.
To facilitate future studies on leveraging LLMs for medical evidence summarization, we present a benchmark dataset, MedReview, consisting of 8161 pairs of meta-analysis results and narrative summaries from the Cochrane Library, published on 37 topics between April 1996 and June 2023. The dataset consists of training, validation, and test sets (see details in the "Method" section and Supplementary Table 1). This collection is an extension of our previous study of evaluating LLMs for evidence summarization and covers a wider range of specialties and writing styles, which highlight common text summarization challenges (Fig. 1a).