3 Sources
3 Sources
[1]
Multimodal learning enables chat-based exploration of single-cell data - Nature Biotechnology
The CellWhisperer multimodal AI connects transcriptomes and text We present CellWhisperer, a multimodal AI that enables interactive scRNA-seq data exploration with natural-language conversations. Our method was created in three steps (Fig. 1a): (1) LLM-assisted curation of multimodal training data, resulting in 1,082,413 pairs of human RNA-seq profiles and matched textual annotations; (2) training of the CellWhisperer embedding model, which places the transcriptomes and their AI-curated textual descriptions into a joint embedding space for cell search and annotation; and (3) development of the CellWhisperer chat model for transcriptome-aware question answering and natural-language chats. This section summarizes each of these three steps, while further technical details are provided in the Methods, Supplementary Notes 2 and 3 and Extended Data Fig. 1. First, we created a large training dataset of transcriptomes (including bulk RNA-seq profiles and scRNA-seq derived pseudo-bulk profiles) with concise textual annotations (such as 'Renal cell carcinoma tissue sample taken from a male individual at stage 2, with no metastasis, preserved in formalin-fixed paraffin-embedded blocks') across the wide range of cell types and conditions captured by GEO and CELLxGENE Census. GEO comprises human RNA-seq data from more than 20,000 individual studies based on researcher submissions, which provides tremendous thematic breadth but also a need for data harmonization. We used the ARCHS4 uniform reprocessing of GEO data and developed an LLM-assisted curation procedure to create concise, coherent and biologically informative textual annotations for each sample based on sample-specific metadata provided by GEO (which includes cell types, organs, tissues, diseases, experimental methods and scientific project abstracts). LLM prompts and illustrative results are shown in Supplementary Note 2. This AI-assisted data curation yielded a standardized dataset of 705,430 human transcriptomes with matched textual annotations. We also derived pseudo-bulk transcriptomes from several hundred scRNA-seq datasets in the CELLxGENE Census, including reference maps from the Human Cell Atlas. We grouped the cells in each dataset on the basis of the provided metadata and calculated pseudo-bulk transcriptomes by averaging across all scRNA-seq profiles per group. We then applied our LLM-assisted curation procedure to condense the metadata for each group into concise biological descriptions, resulting in 376,983 human transcriptomes with matched textual annotations. Second, we used the combined set of 1,082,413 annotated transcriptomes to train the multimodal CellWhisperer embedding model, which integrates the two data modalities into a joint embedding space (Extended Data Fig. 1a and Fig. 1a). To that end, we adapted the contrastive language image pretraining (CLIP) architecture, processing the transcriptomes with the Geneformer model for gene expression and the textual annotations with the BioBERT model for biomedical text. The two resulting vectors were mapped into a 2,048-dimensional multimodal embedding space using conventional feed-forward neural network layers. We then trained this model to place the two modality-specific embeddings in close proximity within the joint embedding space. We validated that the resulting CellWhisperer embedding model was capable of retrieving the transcriptome corresponding to a given textual annotation and vice versa (a standard metric of CLIP model performance) observing a mean area under the receiver operating characteristic curve (AUROC) value of 0.927 (Extended Data Fig. 1b). The trained CellWhisperer embedding model can be prompted with free-text queries to find matching transcriptomes. The query is processed with the BioBERT-based language model and the resulting embedding is compared to transcriptome embeddings from the Geneformer-based model. The result is a quantitative measure (the 'CellWhisperer score') that assesses the match between the query and each transcriptome in the examined dataset. A high CellWhisperer score indicates that a transcriptome constitutes a good fit for the free-text query. Third, to enable natural-language chats that take the transcriptome information into account, we customized and fine-tuned the Mistral 7B open-weights LLM to incorporate CellWhisperer transcriptome embeddings in addition to text queries. Our approach is inspired by multimodal LLMs that can interpret and converse about images, such as GPT-4, Gemini and LLaVA. We generated a training dataset of 106,610 conversations including simple rule-based question-answer pairs (for example, 'What does the sample represent?', with the sample's textual annotation as the designated answer) and more complex LLM-generated conversations about transcriptomes and cells (technical details are provided in the Methods, examples in Supplementary Note 2). We used the embeddings together with the training-set questions as input to the Mistral 7B LLM (with an adapter layer that converts the embeddings into Mistral-compatible token-level embeddings) and fine-tuned this LLM to produce the matched answers. The resulting fine-tuned LLM responds to free-text questions and engages in natural-language chats about cells and their biological functions, gene-regulatory mechanisms and other biological processes that can be linked to transcriptional cell states. To illustrate CellWhisperer's ability to process, organize and annotate large transcriptome datasets, we clustered the CellWhisperer embeddings for 705,430 GEO-derived human transcriptomes and used the CellWhisperer chat model to textually annotate these clusters (Fig. 1b; interactive version at https://cellwhisperer.bocklab.org/geo). The CellWhisperer embeddings successfully captured cell types, developmental stages, tissues, diseases and other cell characteristics. For example, when querying the embedding model with the search term 'infection' and projecting the CellWhisperer score (which quantifies the match between the query and each transcriptome) on the UMAP (uniform manifold approximation and projection) visualization of transcriptomes from GEO, it highlights clusters of cells involved in the immune response to infections (Fig. 1c). As each data point in this UMAP connects back to a sample in the GEO database, we can retrieve the corresponding metadata and for example assess the popularity of RNA-seq analysis for certain cell clusters and biological functions over the last decade (Fig. 1d). In summary, we built a multimodal AI that facilitates the seamless transition from transcriptomes to text and vice versa and enables the chat-based analysis of bulk and scRNA-seq data in English language. To assess how well the multimodal CellWhisperer embedding model has learned relevant aspects of human biology, we tested its ability to predict cell characteristics such as cell types, diseases, tissues and organs on the basis of cell transcriptomes in a zero-shot manner (that is, without task-specific fine-tuning or reference data). To that end, we selected expert-annotated transcriptome datasets that were not included in CellWhisperer's training data and we used CellWhisperer to assign scores for each potential cell type label to each transcriptome (Fig. 2a). We then calculated the coherence between the correct cell type labels (as annotated in the dataset) and the computed CellWhisperer scores to quantify CellWhisperer's ability to correctly annotate and identify cells and transcriptomes. We provide detailed evaluation results for this analysis in Supplementary Table 1. In the Tabula Sapiens dataset, which comprises scRNA-seq profiles for 483,152 cells from 24 organs, CellWhisperer distinguished 20 common cell types with an AUROC value of 0.94 (Fig. 2b,c). Mix-ups were mainly between closely related cell types, such as 'monocytes' versus 'classical monocytes' and between subgroups of T cells (Fig. 2b). Across all 177 annotated cell types, we obtained an AUROC value of 0.91, but with a lower accuracy value given many highly similar cell types (Fig. 2c). For bulk RNA-seq profiles of immune cells from the ImmGen consortium (GSE227743) and for a recently published scRNA-seq dataset of immune cells from Asian individuals, we obtained AUROC values above 0.99; for a challenging scRNA-seq meta-analysis of human pancreas with closely related cell types and pronounced batch effects, the AUROC value was 0.89 (Fig. 2c). These results support the robustness of our model. Although the CellWhisperer embedding model was never specifically trained to predict cell types (this capability emerged from the more general task of learning connections between transcriptomes and their textual annotations), its zero-shot predictions performed better than a widely used marker-based method and on par with three scFMs that were fine-tuned for cell type prediction (Fig. 2c and Extended Data Fig. 2a). We also assessed our use of Geneformer as the scFM in the CellWhisperer embedding model relative to two alternative scFMs (scGPT and UCE) and we observed comparable performance trends (Extended Data Fig. 2a). To test whether CellWhisperer can also predict other cell characteristics, we assessed its zero-shot prediction performance for sample annotations of diseases, tissues and organs. To that end, we assembled a collection of 14,112 disease-associated transcriptomes from GEO that were excluded from our training data. Predicting 229 disease subtypes represented in this Human Diseases dataset, CellWhisperer achieved an AUROC value of 0.82 (Fig. 2d), indicating that disease prediction is harder than cell type prediction but possible with a performance that is substantially better than a random baseline. Similarly, CellWhisperer was able to predict the tissue-of-origin of bulk and single-cell transcriptomes with better-than-random prediction performance both in the Tabula Sapiens dataset (AUROC: 0.75) and in the Human Diseases dataset (AUROC: 0.87) (Fig. 2d). To gauge the breadth of biological processes captured by our model, we investigated its recognition of expert-curated gene sets spanning diverse areas of biology. For each of 8,812 gene sets, we used the gene set label (such as 'colorectal cancer') as a query text to CellWhisperer and determined how well each sample in our Human Diseases dataset matched the query. We then calculated the correlation between this purely text-based assessment (which does not use any information about which genes are part of the gene set) and the gene expression enrichment for the genes in the gene set, across all samples in the Human Diseases dataset (Extended Data Fig. 2b). In other words, we tested whether CellWhisperer had implicitly learned an understanding of the genes that matter for established biological concepts, represented here by gene sets and their labels. We found a clear positive association between CellWhisperer scores for these labels and the expression of their corresponding gene sets (Extended Data Fig. 2c,d and Supplementary Table 2), indicating that our model has learned (albeit imperfectly) many of the tested biological concepts. Importantly, CellWhisperer achieved this by training on transcriptomes and their textual annotations, without having seen any expert-curated gene sets during model training. For further evaluation, we tested how well our model can distinguish between biological signal and technical noise in the Tabula Sapiens dataset, based on an established benchmark for dataset integration and batch effect correction. We observed improved performance of the CellWhisperer multimodal embeddings compared to transcriptome-only scFMs, for both Geneformer and scGPT, whereas UCE did not profit from the multimodal CellWhisperer training (Extended Data Fig. 2e). The best overall performance was obtained for the standard version of CellWhisperer, which uses Geneformer for the transcriptome embedding. Lastly, we assessed how well the CellWhisperer embedding model handles complex prompts and variations within them, based on a scRNA-seq dataset of human embryonic development (described in detail below). We systematically compared different wordings of the same queries and observed strong concordance between their CellWhisperer scores (Extended Data Fig. 2f). Nevertheless, CLIP-based models are known to be sensitive to prompt variations and we caution that different query wordings may result in different results. In summary, multiple lines of evidence (including zero-shot prediction of cell types, diseases, tissues and organs, a data integration task, gene set prediction from their labels and evaluation of prompt variations) support our conclusion that the CellWhisperer embedding model has learned a meaningful representation of cell states and biological processes, based on training data of transcriptomes and matched textual annotations. To illustrate CellWhisperer's utility in a more complex biological application, we performed a meta-analysis of embryonic development on the basis of scRNA-seq data of human embryos that we curated from the literature. We identified and integrated six separate datasets with 95,092 scRNA-seq profiles of human embryos collected 3-38 days after fertilization. These data, which were not part of our training dataset, were processed and annotated with CellWhisperer (Fig. 3a; https://cellwhisperer.bocklab.org/development). To investigate whether CellWhisperer can identify temporal dynamics in embryonic development, we prepared queries corresponding to four key developmental stages using LLM-based aggregation of vertebrate embryology descriptions. The CellWhisperer scores for these queries matched the expected timing for these stages (Fig. 3a). We next used a similar approach to identify phases of organ development, querying CellWhisperer with the names of ten organs (Extended Data Fig. 3a) as illustrated for 'heart' (Fig. 3b). These basic text queries implicitly captured a gradual activation of genes important for organ development, which we validated against the expression of organ-specific marker genes derived from an atlas of fetal gene expression (Extended Data Fig. 3a). CellWhisperer embeddings are biologically interpretable not only through their link to descriptive text but also by examining genes associated with high CellWhisperer scores. We determined CellWhisperer-identified marker genes for each of the ten investigated organs (Supplementary Table 3) and indeed observed strong overlap with previously reported organ marker genes (median odds ratio: 3.3) (Fig. 3c and Extended Data Fig. 3b). For further validation, we investigated how frequently the CellWhisperer-specific marker genes were co-mentioned with the corresponding organ in publications from the PubMed database of biomedical literature. We found that these genes were co-mentioned with the organ much more frequently than a random set of genes and comparably often as the previously reported organ marker genes Genes that were shared between both analyses had the highest frequency of co-mentioning (Extended Data Fig. 3b). For each organ, the CellWhisperer analysis identified at least ten new marker genes beyond the previously reported organ marker genes (Supplementary Table 3). These genes had strong support from our analysis of co-mentioning in the biomedical literature (Fig. 3d and Extended Data Fig. 3c). In addition, we observed gene set enrichments for biological functions that are characteristic for the corresponding organs (shown for heart in Extended Data Fig. 3d) and a strong spatial expression correspondence with established and widely used organ marker genes, as validated using a 3D atlas of a gastrulating human embryo (Fig. 3e). In summary, we applied CellWhisperer to the common and nontrivial task of marker gene discovery across multiple user-provided scRNA-seq datasets, which was achieved using simple text queries (comprising only the organ name) and yielded results that complement previously reported organ marker genes at comparable precision. To make CellWhisperer broadly accessible for chat-based analysis of transcriptome data, we integrated it with the CELLxGENE Explorer by adding a CellWhisperer-powered chat box (Fig. 4a; https://cellwhisperer.bocklab.org). CELLxGENE Explorer is an interactive web tool for analyzing scRNA-seq profiles through visual inspection, filtering and differential analysis of cells and samples. CellWhisperer complements CELLxGENE Explorer's functionality for visual analysis by providing natural-language data exploration capabilities including (1) free-text search for cells with user-specified properties; (2) automatic textual annotation of cell clusters; and (3) chat-based investigation of interactively selected cells. More generally, CellWhisperer enables the discussion of cells and genes in natural language through a chat box integrated with the visual features of a single-cell browser. We provide a list of usage examples in Supplementary Note 1. Here, we illustrate CellWhisperer's functionality on the Tabula Sapiens dataset of human organs (Fig. 4). In previous work, we described widespread immune gene activity in nonhematopoietic, structural cells of the mouse, prompting us to explore this phenomenon in a large multi-organ human scRNA-seq dataset. We, thus, entered 'structural cells with immune functions' into the CellWhisperer chat box and obtained the corresponding CellWhisperer score as a color-coded overlay to the UMAP visualization of the Tabula Sapiens dataset (Fig. 4a,b). Among the cells that scored highly for this query were endothelial and epithelial cells, fibroblasts and pericytes (Fig. 4b), which are all known or suspected to have important immune-regulatory roles. To investigate these cells in more detail, we sequentially selected cell clusters with high CellWhisperer scores (by drawing a circle around the cells of interest) and prompted CellWhisperer by entering 'Describe these cells in detail' into the chat box (Fig. 4a-c). For each cell cluster, we obtained textual descriptions that were generated by the CellWhisperer chat model on the basis of the CellWhisperer transcriptome embeddings averaged across the selected cells (Fig. 4b). The resulting descriptions contained information about cell types, organs and developmental stages and, less frequently, details about potential sample donors (such as male or female), highly expressed genes (such as genes encoding collagens and matrix metalloproteinases in fibroblasts), biological functions (such as stress response) and other annotations. We found that the generated descriptions frequently referred to potential immune functions of the selected cells, consistent with our initial search query. To obtain additional information about these cells, we interactively selected one of the cell clusters and asked two follow-up questions: 'What is the potential relevance of these immune functions?' and 'How can the genes and pathways that are upregulated in these cells mechanistically contribute to these immune functions?'. This resulted in a coherent conversation with CellWhisperer, providing further characterization with highlighted genes and biological functions that are relevant in the selected cells (Fig. 4c). As a plausibility check, we confirmed the expression of those genes by projecting them on the UMAP (Fig. 4d). Lastly, we benchmarked the CellWhisperer chat model using the perplexity metric, which is a common evaluation criterion for LLMs. We assessed how well each question-answer pair fits with the matched transcriptome in two test sets of biologically meaningful conversations (Methods). In our Evaluation Conversations dataset with 200 question-answer pairs, we observed a 90% preference for matched over unmatched transcriptomes (Extended Data Fig. 4a), which confirms that our LLM meaningfully interpreted the transcriptome embedding for its response generation. Furthermore, in the Cell Type Conversations dataset, we found that most cell type labels showed a preferential association with their matched transcriptomes (Extended Data Fig. 4b). We further assessed the perplexity for responses obtained with the Mistral 7B LLM (which the CellWhisperer chat model builds upon) and for the much larger Llama 3.3 70B LLM (Extended Data Fig. 4c). CellWhisperer achieved best results (lowest perplexity values), even on the out-of-distribution Cell Type Conversations dataset, further supporting that our chat model effectively incorporates the CellWhisperer transcriptome embeddings. We also assessed whether the CellWhisperer chat model may benefit from explicitly providing a list of highly expressed genes as part of the prompt (as commonly done when analyzing transcriptomes with text-only LLMs), in addition to the transcriptome embedding. We observed a mild beneficial effect (Extended Data Fig. 4c) and implemented this hybrid approach in the CellWhisperer web tool. In summary, the integration of a CellWhisperer chat box in the CELLxGENE Explorer software provides user-friendly access to CellWhisperer's AI features and demonstrates the complementarity of visual inspection and natural-language chats for the interactive exploration of scRNA-seq data. To analyze user-provided transcriptome datasets with CellWhisperer, we developed a data-processing pipeline that computes CellWhisperer embeddings and annotations on the basis of the read count matrices from bulk RNA-seq or scRNA-seq (details are provided in the source code repository: https://github.com/epigen/cellwhisperer). The processed data are stored in a single file for dynamic loading into a user-hosted instance of CellWhisperer, while also facilitating reproducibility and sharing of CellWhisperer analyses. Here, we describe a typical CellWhisperer data analysis, investigating stem and progenitor cells in human colon and their response to inflammation (Fig. 5a-f); and we compare it to conventional bioinformatics analysis (Fig. 5g-l). Our analyses are based on scRNA-seq data of pathogenic and adjacent normal biopsies of persons with inflammatory bowel disease and healthy controls. The cluster labels generated by CellWhisperer (Fig. 5a) provide an initial overview of the dataset (Fig. 5b), identifying epithelial cells ('Cycling ileal epithelial precursor cells' and 'Large intestine goblet Cells') as well as immune cells ('Activated CD8 T cells in intestine' and 'Mast cells expressing inflammatory marker genes'). Among the 'Cycling ileal epithelial precursor cells', we searched for cells with stem cell characteristics using the CellWhisperer query 'Show me stem cells' and identified a subset of cells within this cluster that scored highly for this query (Fig. 5c). Further investigation of these putative stem cells in a follow-up conversation with CellWhisperer (Fig. 5d) suggested that this cell cluster includes LGR5-expressing epithelial stem cells, which constitute well-established stem cells of the gut. As expected, LGR5 gene expression (Fig. 5e) was highly correlated with the CellWhisperer score for the 'Show me stem cells' query (Fig. 5c). We further compared the prevalence of the CellWhisperer-annotated epithelial stem cells between inflamed and noninflamed colon samples and we observed higher CellWhisperer scores for the 'stem cells' query among the noninflamed samples (Fig. 5f). These results suggest that chronic gut inflammation in persons with inflammatory bowel disease has a negative effect on LGR5-expressing epithelial stem cells, matching the conclusions of the study from which the dataset was obtained and previous in vitro experiments. Importantly, these analyses were performed swiftly and interactively with CellWhisperer. All figure panels (Fig. 5b-f) were taken from the web tool as screenshots (https://cellwhisperer.bocklab.org/colonic_epithelium). For comparison, we sought to reproduce these results with a conventional bioinformatics analysis using custom Python code (Fig. 5g). We downloaded and preprocessed the gene expression profiles from GEO and visualized them as a UMAP (Fig. 5h, left). We observed substantial batch effects (which was less of an issue in the CellWhisperer analysis because the embedding model intrinsically adjusts for batch effects, as illustrated in Fig. 5a and Extended Data Fig. 2e); hence, we corrected for batch effects using the scVI method (Fig. 5h, right). Next, we performed cell type annotation using the CellTypist software tool. With CellTypist's recommended parameters, no cell cluster was annotated as stem cells (Fig. 5i); however, when we reran CellTypist to predict the cell types of individual cells instead of cell clusters, we uncovered a subset of cells annotated as stem cells that were part of the broader cluster of transient-amplifying cells (Fig. 5j). These cells were characterized by high levels of LGR5 expression (Fig. 5k), confirming that these are indeed epithelial stem cells. Lastly, we calculated a general 'stemness score' on the basis of a previously reported gene set and observed higher values in inflamed than in noninflamed colon samples (Fig. 5l), consistent with the CellWhisperer results. This conventional bioinformatics analysis reproduced the conclusions of the interactive CellWhisperer analysis but it was much more complex and time-consuming. Overall, it took 400 lines of custom Python code, calls to five specialized software tools and the expertise of an experienced bioinformatician to plan and conduct the analysis. In summary, CellWhisperer offers a rapid initial assessment of scRNA-seq datasets and an interactive approach to data exploration and hypothesis generation. In contrast, conventional bioinformatics analysis provides more fine-grained control and better traceability. Given the complementary strengths of these two approaches, we envision that chat-based analysis will guide rather than replace sophisticated code-based analyses.
[2]
Chatting with your cells: Natural-language AI for single-cell data analysis
Using sophisticated RNA sequencing technology, biomedical researchers can measure the activity of our genes across millions of single cells, creating detailed maps of tissues, organs, and diseases. Analyzing these datasets requires a rare combination of skills: a deep understanding of the biology, and the ability to develop computer code that turns data into insights. What if we could equip biomedical researchers with an AI assistant that sees the data, supports the analysis, knows about the biology, and is easy to talk to? This could give scientists a virtual, AI-based colleague with both biological and bioinformatics expertise to support them in their research. Toward this goal, researchers led by Christoph Bock, Principal Investigator at the CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences and Professor at the Medical University of Vienna, have developed CellWhisperer. CellWhisperer is an AI method and software tool that links gene expression with descriptive text across more than a million biological samples. It provides an AI chat box to investigate complex biology in English language, unburdened by the complexities of computer code. This study, published in Nature Biotechnology, demonstrates how AI creates a new way for scientists to interact with their data when studying the biological foundations of diseases. From genes to text -- and vice versa CellWhisperer uses multimodal deep learning on gene activity profiles and matched biological text, which the authors curated from public databases with the help of AI models. Combining these two data modalities, it becomes possible to search massive datasets with text-based queries such as "Show me immune cells from the inflamed colon of patients with autoimmune diseases." The CellWhisperer multimodal AI further integrates a large language model that was trained to emulate discussions between biologists and bioinformaticians when analyzing data. Chatting with CellWhisperer thus sounds a bit like talking to a bioinformatics colleague, relying on CellWhisperer's view of the biological data and the biological knowledge of the large language model. For example, users can ask CellWhisperer about genes that are active in cells of interest, and let the model comment on potential biological implications. CellWhisperer is built into a user-friendly web frontend based on the popular CELLxGENE browser and freely accessible online. "By training on experimental data of 20,000 studies from the last two decades, CellWhisperer learned about the biological roles of genes and cells," explains co-first author Moritz Schaefer, a former Postdoctoral Researcher in Bock's research group at CeMM and now at Stanford University. "This way, CellWhisperer is prepared to analyze new single-cell RNA sequencing data from many areas, making biomedical data exploration easier and more exciting." A step toward AI research agents To illustrate CellWhisperer's potential for biological discovery, the team applied it to single-cell RNA sequencing data of human embryonic development. With basic queries such as "heart" or "brain," the model identified developmental time points, cell populations, and marker genes associated with human organ formation. Many of these markers matched known developmental genes, while others pointed to previously overlooked candidates. "CellWhisperer is not just making biomedical research easier, it helps me understand what is going on in the cells that I am studying," says Peter Peneder, co-first author at the St. Anna Children's Cancer Research Institute. "Science is teamwork, and with CellWhisperer, an AI research assistant has joined our team. CellWhisperer really helps with exploratory research -- getting a first impression of a new dataset and figuring out where to dig deeper. It supports and empowers us as human scientists," emphasizes Bock.
[3]
AI chat box helps investigate complex biology in English language
CeMM Research Center for Molecular Medicine of the Austrian Academy of SciencesNov 11 2025 Using sophisticated RNA sequencing technology, biomedical researchers can measure the activity of our genes across millions of single cells, creating detailed maps of tissues, organs, and diseases. Analyzing these datasets requires a rare combination of skills: deep understanding of the biology, and the ability to develop computer code that turns data into insights. What if we could equip biomedical researchers with an AI assistant that sees the data, supports the analysis, knows about the biology, and is easy to talk to? This could give scientists a virtual, AI-based colleague with both biological and bioinformatics expertise to support them in their research. Toward this goal, researchers led by Christoph Bock, Principal Investigator at the CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences and Professor at the Medical University of Vienna, have developed CellWhisperer. CellWhisperer is an AI method and software tool that links gene expression with descriptive text across more than a million biological samples. It provides an AI chat box to investigate complex biology in English language, unburdened by the complexities of computer code. This study, published in Nature Biotechnology, demonstrates how AI creates a new way for scientists to interact with their data when studying the biological foundations of diseases. From genes to text - and vice versa CellWhisperer uses multimodal deep learning on gene activity profiles and matched biological text, which the authors curated from public databases with the help of AI models. Combining these two data modalities, it becomes possible to search massive datasets with text-based queries such as "Show me immune cells from the inflamed colon of patients with autoimmune diseases". The CellWhisperer multimodal AI further integrates a large language model that was trained to emulate discussions between biologists and bioinformaticians when analysing data. Chatting with CellWhisperer thus sounds a bit like talking to a bioinformatics colleague, relying on CellWhisperer's view of the biological data and the biological knowledge of the large language model. For example, users can ask CellWhisperer about genes that are active in cells of interest, and let the model comment on potential biological implications. CellWhisperer is built into a user-friendly web frontend based on the popular CELLxGENE browser and freely accessible online: https://cellwhisperer.bocklab.org. "By training on experimental data of 20,000 studies from the last two decades, CellWhisperer learned about the biological roles of genes and cells," explains co-first author Moritz Schaefer, a former Postdoctoral Researcher in Christoph Bock's research group at CeMM and now at Stanford University. "This way, CellWhisperer is prepared to analyse new single-cell RNA sequencing data from many areas, making biomedical data exploration easier and more exciting." A step toward AI research agents To illustrate CellWhisperer's potential for biological discovery, the team applied it to single-cell RNA sequencing data of human embryonic development. With basic queries such as "heart" or "brain", the model identified developmental time points, cell populations, and marker genes associated with human organ formation. Many of these markers matched known developmental genes, while others pointed to previously overlooked candidates. CellWhisperer is not just making biomedical research easier, it helps me understand what is going on in the cells that I am studying." Peter Peneder, co-first author, St. Anna Children's Cancer Research Institute "Science is teamwork, and with CellWhisperer, an AI research assistant has joined our team. CellWhisperer really helps with exploratory research - getting a first impression of a new dataset and figuring out where to dig deeper. It supports and empowers us as human scientists," emphasizes Christoph Bock. CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences Journal reference: Schaefer, M., et al. (2025). Multimodal learning enables chat-based exploration of single-cell data. Nature Biotechnology. doi: 10.1038/s41587-025-02857-9. https://www.nature.com/articles/s41587-025-02857-9
Share
Share
Copy Link
Researchers at CeMM have developed CellWhisperer, a groundbreaking multimodal AI that enables scientists to explore complex single-cell RNA sequencing data through natural language conversations, potentially revolutionizing biomedical research accessibility.
Researchers at the CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences have developed CellWhisperer, a groundbreaking artificial intelligence system that enables biomedical scientists to explore complex single-cell RNA sequencing data through natural language conversations
1
. Published in Nature Biotechnology, this multimodal AI represents a significant advancement in making sophisticated biological data analysis accessible to researchers without extensive programming expertise2
.Led by Christoph Bock, Principal Investigator at CeMM and Professor at the Medical University of Vienna, the research team created an AI assistant that combines deep biological knowledge with bioinformatics capabilities, effectively serving as a virtual colleague for scientists studying disease mechanisms and cellular behavior
3
.CellWhisperer's development involved a sophisticated three-step process that resulted in a comprehensive multimodal AI system. The researchers first created an extensive training dataset comprising 1,082,413 pairs of human RNA sequencing profiles matched with textual annotations
1
. This dataset was assembled using LLM-assisted curation procedures applied to data from GEO and CELLxGENE Census databases, covering over 20,000 individual studies from the past two decades.The team utilized the ARCHS4 uniform reprocessing of GEO data and developed AI-assisted curation to create concise, biologically informative textual annotations for each sample. These annotations included detailed descriptions such as cell types, organs, tissues, diseases, experimental methods, and scientific project abstracts
1
. Additionally, they derived pseudo-bulk transcriptomes from hundreds of single-cell RNA sequencing datasets, grouping cells based on metadata and calculating averaged transcriptomes per group.The CellWhisperer embedding model adapts the contrastive language image pretraining (CLIP) architecture, processing transcriptomes with the Geneformer model for gene expression analysis and textual annotations with the BioBERT model for biomedical text processing
1
. The system maps these inputs into a 2,048-dimensional multimodal embedding space using feed-forward neural networks, training the model to place corresponding transcriptomes and text descriptions in close proximity within the joint embedding space.Validation testing demonstrated impressive performance, with the model achieving a mean area under the receiver operating characteristic curve (AUROC) value of 0.927
1
. This high performance enables researchers to use free-text queries to find matching transcriptomes, with the system providing quantitative CellWhisperer scores that assess the match quality between queries and transcriptomes in examined datasets.Related Stories
To enable natural language conversations, the researchers customized and fine-tuned the Mistral 7B open-weights large language model to incorporate CellWhisperer transcriptome embeddings alongside text queries
1
. This approach draws inspiration from multimodal LLMs like GPT-4, Gemini, and LLaVA, creating a system that can interpret and discuss biological data conversationally.
Source: Phys.org
The training process included generating 106,610 conversations encompassing simple rule-based question-answer pairs and complex AI-generated discussions about transcriptomes and cells
1
. This enables researchers to make queries such as "Show me immune cells from the inflamed colon of patients with autoimmune diseases" and receive relevant biological insights2
.CellWhisperer is integrated into a user-friendly web frontend based on the popular CELLxGENE browser and is freely accessible online, making it available to the global research community
3
.Summarized by
Navi
[1]
09 Jan 2025•Science and Research

19 Mar 2025•Science and Research

10 Aug 2024
