Advancing evidence-based medicine requires integrating clinical expertise with data analysis. While clinicians contribute essential domain knowledge, applying modern data science methods often requires specialized training, creating a barrier to adoption. To bridge this gap, we developed ChatDA, an artificial intelligence agent enabling large language model-mediated conversational analysis of de-identified clinical tabular datasets. ChatDA empowers clinicians to extract meaningful insights efficiently and accurately, making data-driven clinical research more accessible and effective.
Data analysis is fundamental to evidence-based medicine, supporting both clinical decision-making and the interpretation of research findings. However, many clinicians may face technical challenges when analyzing data, as their training often emphasizes clinical practice over statistics and data science. These challenges may lead to inefficiencies or reliance on external data analysts, preventing clinical experts from fully engaging with their data and potentially slowing the translation of data-driven insights into actionable interventions. Recent advancements in artificial intelligence (AI), particularly large language models (LLMs), offer a promising path toward lowering these barriers by expanding access to data science capabilities among clinicians.
LLMs have demonstrated remarkable potential across various healthcare applications, from bioinformatics research assistance to AI-assisted clinical decision-making. These models excel in domain knowledge recall, information retrieval, logical reasoning, and code generation. Particularly relevant to clinical research assistance, AI-powered code-writing agents have proven highly effective for data analysis tasks from statistical testing to machine learning. While such agents hold promise for AI-assisted clinical data analysis, their effectiveness hinges on the capabilities of powerful LLMs, as smaller models struggle with accurately generating complex code scripts. These high-performance LLMs rely on cloud-based inference, raising data privacy concerns: when code-writing agents process clinical data, they may inadvertently expose sensitive patient information to external providers. Even if the data is de-identified, the sharing of detailed patient-level information increases the risk of re-identification, posing significant ethical and regulatory challenges. This underscores the tension between leveraging AI in clinical research and maintaining data protection standards.
To address these challenges, we developed Chat Data Analyst (ChatDA), an AI agent designed for analyzing de-identified tabular clinical data that operates through specialized tool use rather than code generation. ChatDA reduces the data privacy risks associated with conventional code-writing agents by restricting the language model's access to data: analyses are conducted through custom tools that return only population-level insights, ensuring individual-level data remains inaccessible to the underlying cloud-hosted LLM. Encouraged by recent studies demonstrating the effectiveness of agentic tool use in practical applications, we hypothesized that ChatDA, similarly equipped with a specialized toolkit, could match or outperform state-of-the-art code-execution agents. Moreover, this approach may offer additional advantages: by guiding analyses through well-defined, pre-validated operations instead of free-form code execution, it enhances result consistency, reduces the risk of errors, and improves interpretability.
In this study, we evaluate our hypothesis by benchmarking ChatDA across a range of standard clinical data analysis tasks, including summary statistics generation, statistical testing, regression analysis, and machine learning modeling. We compare its performance to four existing agents on 21 publicly available datasets: (1) OpenAI Advanced Data Analysis (ADA), a ChatGPT feature that enables OpenAI models to execute code to analyze user-uploaded data; (2) Open Interpreter, an open-source coding interface for LLMs; (3) Data Interpreter, an open-source agent that tackles complex data analysis tasks primarily through code generation; and (4) LAMBDA, a peer-reviewed coding agent designed for data analysis. Then, to demonstrate the practical utility of ChatDA, we present a case study in which ChatDA analyzes a proprietary, de-identified hip arthroplasty dataset to extract meaningful population-level insights.
ChatDA, illustrated in Fig. 1A, is an LLM-powered agent that operates a custom data science toolkit. ChatDA's toolkit -- an extension of TableMage, a novel software package for low-code clinical data science -- provides an integrated data and analysis environment, supporting data transformation, statistical testing, figure generation, regression analysis, and machine learning modeling (Methods). ChatDA can optionally operate a Python interpreter, which can be enabled to boost local LLM-powered agent performance. When powered by a multimodal LLM such as GPT-4o, ChatDA can analyze the figures it generates, further enhancing its capability to interpret results.
To evaluate ChatDA's accuracy in common data analysis tasks, we curated a novel benchmark for assessing AI agents in data workflows ("Methods", Fig. 1B). We compared ChatDA against OpenAI ADA, Open Interpreter, Data Interpreter, and LAMBDA using this benchmark. ChatDA achieved the highest accuracy and stability across all question topics with a 10% to 18% improvement in accuracy over the next-best agent. All reported accuracy metrics are accompanied by 95% confidence intervals to quantify performance variability, and standard errors were computed across repeated trials to account for response fluctuation inherent in stochastic LLM behavior. Specifically, ChatDA achieved an overall accuracy of 0.951 (95% CI: ±0.008), compared to 0.830 ± 0.010 for OpenAI ADA, 0.803 ± 0.029 for Open Interpreter, 0.701 ± 0.011 for Data Interpreter, and 0.599 ± 0.035 for LAMBDA (Fig. 1C, Supplementary Fig. 1). Notably, ChatDA exhibited greater output stability than other agents, as reflected in its narrower confidence intervals across evaluation runs. This robustness addresses a well-documented challenge in LLM-powered agents -- namely, the variability and unpredictability in generated outputs under identical prompts. By relying on structured tool-based workflows rather than unconstrained code generation, ChatDA reduces variance in performance and delivers more consistent, interpretable results. ChatDA also outperformed other code-writing agents in retaining data transformations, such as scaled or engineered features, between analysis steps (ChatDA: 0.887 ± 0.024; OpenAI ADA: 0.692 ± 0.030; Open Interpreter: 0.696 ± 0.050; Data Interpreter: 0.296 ± 0.019; LAMBDA: 0.479 ± 0.069; see Supplementary Fig. 1), likely due to its data-integrated toolkit, which automatically preserves transformations for future tasks.
Next, we evaluated ChatDA's capability to train machine learning models across a set of publicly curated regression and classification tasks for tabular datasets (Methods). ChatDA achieved competitive performance on the machine learning benchmark, outperforming OpenAI ADA in 10 out of 11 tasks (i.e., achieving a lower average test RMSE for regression or a higher average test AUC for classification), Open Interpreter in 6 out of 11 tasks, Data Interpreter in 9 out of 11 tasks, and LAMBDA in 7 out of 11 tasks. ChatDA's performance on both the data analysis benchmark and the machine learning benchmark demonstrates its potential to match or surpass existing state-of-the-art conversational data analysis solutions at machine learning modeling. Complete machine learning benchmark results are presented in Supplementary Figs. 2-3.
To demonstrate ChatDA in practice, we conducted a conversational case study using real-world de-identified clinical data (Methods). Through the ChatDA user interface (Fig. 1D), we analyzed an in-house proprietary dataset of 1419 knee arthroplasty patients to examine whether quantitative anatomical and functional features could predict the surgeon-selected procedure -- Total Knee Arthroplasty (TKA) or Unicompartmental Knee Arthroplasty (UKA). ChatDA first summarized the dataset, removed cases with missing knee arthroplasty annotations, and performed imputation using median values for numeric features and a "missing" category for categorical ones. Among 35 potential predictors, ChatDA applied the Boruta method and identified 6 top features: joint line convergence angle, lateral tibial width, medial proximal tibial angle, tibiofemoral angle, tibial width, and tibial tubercle-trochlear groove distance. A logistic regression model trained on these features achieved a test area under the receiver operating characteristic (AUC) of 0.632, suggesting moderate predictive power. Joint line convergence angle, tibiofemoral angle, and tibial tubercle-trochlear groove distance emerged as statistically significant predictors. Notably, ChatDA found that for every unit increase in joint line convergence angle, the odds of UKA increased by 14.3%. Although random forest and XGBoost models were also evaluated, they did not outperform logistic regression but reaffirmed the importance of tibial tubercle-trochlear groove distance and tibiofemoral angle. All results were generated in a single uninterrupted conversational session without manual correction, highlighting ChatDA's usability and robustness in a clinical research context Fig. 2B-D. The complete conversation transcript is available in Supplementary Note 6.
ChatDA addresses a critical challenge in AI-assisted de-identified clinical data analysis: enhancing patient data privacy. Traditional de-identification techniques are often insufficient when the individual-level data is passed to cloud-hosted LLMs: even after explicit identifiers are removed, individual-level data points can still lead to re-identification risks. ChatDA introduces two mechanisms to reduce the risk of re-identification.
This work has several limitations that warrant discussion. First, while ChatDA's "tools-only" mode reduces re-identification risk by preventing cloud LLMs from accessing individual-level records, this approach does not eliminate all privacy concerns; modern privacy research has shown that aggregate statistics can still be vulnerable to re-identification. Therefore, ChatDA is intended for use only with datasets that have been properly de-identified in accordance with applicable legal standards (see Supplementary Note 1 for additional discussion and practical guidance). Second, the scope of ChatDA is currently limited to tabular datasets, a focus that does not address the challenges of analyzing other common clinical data types like medical imaging or unstructured text; moreover, without its Python interpreter, ChatDA's functionality is strictly limited to the availability of pre-defined tools (Supplementary Note 2). Third, our accuracy evaluation was performed on a custom-designed DataAnalysisQA benchmark, which may introduce inherent biases; comprehensively benchmarking data analysis agents remains an open challenge (Supplementary Note 3), and additional analyses such as error propagation could further enhance agent evaluation. Fourth, the practical utility of ChatDA is demonstrated through only a single case study, and a formal user study with multiple clinicians is needed to fully validate its real-world effectiveness and usability. Finally, this study compares ChatDA only with code-writing agents and human-generated Python code; as new domain-specific low-code and no-code tools emerge for clinical data analysis, future evaluations should also include those systems.
By reducing barriers to data analysis and enabling secure AI-mediated workflows, ChatDA empowers healthcare professionals to extract meaningful insights from de-identified tables and advance evidence-based medicine. Data analysis agents like ChatDA have the potential to transform quantitative clinical research by accelerating discovery, broadening access to data-driven insights, and strengthening the foundation for clinical decision-making in an increasingly data-rich environment. ChatDA's consistent performance across both benchmark evaluations and real-world applications further establishes it as a reliable and trustworthy tool for data analysis.