However, two notable challenges limit the immediate utility of TCPA. First, the previous RPPA data have limited coverage of protein markers (~200 only). Second, the data portal only provides several pre-defined analytic modules, with little flexibility for user-defined analyses. To address these challenges, we have recently expanded our RPPA protein panel to approximately 500 high-quality antibodies10. This expansion has enabled the development of a comprehensive, high-quality pan-cancer functional proteomics compendium, termed RPPA500, integrating data from both TCGA and CCLE samples. Alongside with our expanded proteomic dataset, here we introduce DrBioRight 2.0 (https://drbioright.org), a cutting-edge chatbot powered by large language models (LLMs). This tool is designed to lower technical barriers, enabling seamless analysis of complex omics data. Users with diverse backgrounds can easily access, analyze, and visualize data seamlessly through intuitive natural language queries.
Using the well-established data processing pipeline and following the guidelines established in the community, our RPPA500 compendium encompasses a total of 9000 samples, comprising both patient tumor and cancer cell line samples. The TCGA cohort dataset includes protein expression profiles from 7828 patient tumors across 32 distinct cancer types (Fig. 1). Predominant tissue types in this dataset include breast (BRCA, n = 881), kidney (KIRC/KIRP/KICH, n = 756), and lung (LUAD/LUSC, n = 693). The CCLE cohort dataset covers 878 cancer cell lines, with lung, blood, lymphocyte, and colorectal lineages, each over 50 distinct cell lines (Fig. 1). Most of these cell lines have parallel functional data such as gene dependency, metastatic potential, and drug sensitivity data. The final RPPA500 protein set contains 447 protein markers, including 357 total proteins and 90 post-translationally modified (PTM) proteins (e.g., phosphorylated proteins), and it is highly enriched in therapeutic targets and biomarkers (Supplementary Data 1). To underscore the expanded coverage of cancer-related pathways, we aligned the protein markers with hallmark gene sets. Our RPPA500 protein panel comprehensively covers all 50 hallmark gene sets (Supplementary Fig. 1), including a robust coverage for apoptosis (n = 43), PI3K-Akt-mTOR signaling (n = 34), estrogen response (n = 32), hypoxia (n = 31), IL6-JAK-STAT3 signaling (n = 31), apical junction (n = 29), interferon response (n = 26), EMT (n = 18), G2M checkpoint (n = 18), P53 pathway (n = 17), KRAS signaling (n = 12), and DNA repair (n = 7). Compared to our previous protein panel, there is a significant increase of 115% in the number of total proteins and a 67% increase in the number of PTM proteins across these gene sets, highlighting a substantially increased capacity to comprehend cancer biology at the protein level.
Recent breakthroughs in LLM-based generative AI have ushered in a transformative era for data analytics. In this study, we have developed a new LLM-based chatbot, DrBioRight 2.0, empowered with natural language processing, enabling users to explore, analyze, and visualize the above RPPA data intuitively and intelligently (Fig. 1). Specifically, we first generated a unified multi-omics dataset with standardization and normalization of patient clinical data, molecular profiling data at DNA, RNA, and RPPA500-based protein levels, as well as cell line phenotypic datasets. Collectively, over 1 billion data values were curated and restructured under the HDF5 format in a No-SQL database hosted on an I/O efficient cloud-based server. Addressing the long-standing challenge of non-standard protein annotation, we thoroughly reviewed protein markers and cross-referenced them with external databases to comprehensively annotate proteins at individual, pathway, functional, and disease levels. This detailed annotation facilitates user-friendly analysis of data with biologically driven questions. DrBioRight has several features that are not available in conventional analytics platforms, including natural language understanding, transparency and reproducibility, and user friendliness. These features are supported by several key cutting-edge techniques: (i) Chat UI: a real-time conversational-based chatting interface; (ii) Prompts: highly customizable LLM-oriented domain-knowledge-specific prompts; (iii) LLMs: LLM-empowered generative AI; (iv) Code generation: seamless code-generation-correction cycle; (v) Plugins: deep-nested interactive plugins provide a unique suite of tools for enhanced effective data visualization and analysis, such as interactive clustering heatmaps.
To demonstrate its utility, we present an illustrative example where users can easily query, "Please generate a heatmap for protein expression data of the current dataset." In response, DrBioRight dynamically processes the data and calls the corresponding heatmap plugin to generate an interactive heatmap (Fig. 2A). Similar to other interactive plugins we have implemented, the heatmap plugin can efficiently handle large datasets. It offers a comprehensive global overview along with numerous features (such as selection, zoom in/out, searching, 2D/3D scatter plots, pathway mapping, and linking to external resources) to facilitate effective data exploration. For a more detailed analysis, users can further ask, "Could you please show me the correlation between AKT2PS474 and IL6 expression?" DrBioRight then extracts the data, performs the corresponding statistical analysis, and presents the results in a clear scatter plot. Leveraging the same dataset, users can conduct a survival analysis by inquiring about the correlation between a protein and the patient survival time, followed by visualization through Kaplan-Meier plots. In contrast to the previous analytic modules at TCPA, DrBioRight distinguishes itself by offering versatile analyses, including customizable interactions with the chatbot. For instance, after performing a survival analysis across all the samples in the full cohort, users can further investigate specific associations within male or female patients or change the colors in a plot. Another noteworthy feature of DrBioRight is its seamless transition between analytics-driven and general questions. As depicted in Fig. 2A, users can request the chatbot to summarize the results. Moreover, DrBioRight allows users to download the corresponding project report in an R markdown file and run it in RStudio locally to reproduce the analysis (Supplementary Fig. 2A). These features collectively position DrBioRight as a highly convenient analytic tool, providing unparalleled flexibility and customization in data analysis.
The system architecture of DrBioRight 2.0 comprises three integral components: (i) a No-SQL database, (ii) a back-end LLM-powered analytics module, and (iii) an interactive chat interface (Fig. 2B). To start an analysis, a user simply begins by selecting a disease (e.g., lung adenocarcinoma [LUAD]). Then, the chatbot automatically links relevant multi-omics data to the user's project space, making it ready for querying and analysis. The back-end LLMs will predict user's intent, distinguishing between general inquiries and questions requiring code generation or bioinformatics analysis. DrBioRight outputs a logical flow based on a chain-of-thought approach to enhance user understanding. In the back end, LLMs generate text-based answers or programming scripts on the fly. Before submission to the job queue, the platform reviews and validates codes, autonomously correcting common errors like missing libraries or incompatible package versions. Following successful result generation, the user-friendly chat interface displays the outcomes. For ongoing improvements, we integrate a rating function that allows users to evaluate analytic results, and the user feedback together with the expert manual evaluations will then guide iterative refinements to fine-tune LLMs through the reinforcement learning from human feedback (RLHF).
To maximize the performance of DrBioRight 2.0, we have implemented cutting-edge techniques to enhance the LLMs (Fig. 3A). Overall, we incorporated a multi-agent workflow to build hierarchical agent teams using a graph architecture (Supplementary Fig. 2B). This framework can better organize the multi-agent system and streamline the development process (Methods). Each team consists of one or more agents or tools. For example, the multi-omics data analysis team uses a heatmap to provide a dataset overview and a survival analysis tool to link proteins with patient survival data. A correlation analysis tool performs association analyses between features including protein expression, mutations, and clinical variables. A supervisor routes team-specific questions to appropriate tools for task execution and analytic results. Each agent is powered by a model coupled with task specific prompts. These prompts include a mini knowledge base on our RPPA500 data, a summary of our meta-data, and general analysis information. To fine-tune LLMs, we curated and standardized thousands of user queries through expert review, creating both training and test datasets. Using the training dataset, we performed model fine-tuning through three steps: (i) initial supervised fine-tuning. The base model was initially fine-tuned using prompt and response pairs to learn domain-specific contexts. (ii) based on the fine-tuned model, we developed an evaluation system to allow domain experts to rank the AI responses (Supplementary Fig. 3). The evaluation datasets were further used to train a reward model. (iii) the optimization step was performed by the PPO (proximal policy optimization) trainer from Hugging Face. To evaluate its performance, we tested our platform using an independent test set of queries not used in the fine-tuning process. Only 26% of the questions could be addressed by our classic TCPA platform (Fig. 3B), highlighting a major need for a versatile and customizable tool for such analyses. We then test the same questions using GPT-4 and achieved a 58% success rate, underscoring the limitations of a general LLM in addressing domain-specific questions through natural language-based data analytics. However, when employing the fine-tuned models under the graph-based workflow using LangGraph on the same set of questions, we achieved an impressive 90% success rate (Methods). This emphasizes the impact of incorporating domain-specific knowledge, fine-tuning process, and multi-agent workflow.