Figure 1 illustrates the TrialBench platform, containing 8 well-defined clinical trial design tasks. The TrialBench platform provides 23 corresponding AI-ready datasets across these 8 tasks, implemented evaluation metrics, and baseline models. AI experts can easily access the datasets and targets to develop advanced models, evaluate models on specific metrics, and compare them against baseline models for reference.
AI-solvable Clinical Trial Task Definitions
In this paper, we identify 8 AI-solvable clinical trial tasks. For each task, we elaborate on its background, explain how it would help clinical trial design and management, curate the dataset, evaluate the performance of well-known AI methods, and report the empirical results. Table 1 summarizes and compares all the AI-solvable clinical trial tasks and corresponding datasets. We provide the following three aspects for each learning task: (1) Background. Background of the learning task. (2) Definition. A formal definition of the learning task (input feature and output). (3) Broad impact. The broader impact of advancing real clinical trials on the task.
Trial Duration Prediction
Background. The duration of a clinical trial is defined as the number of years from the trial's start date to its completion date, representing a continuous numerical value. The clinical trial duration is directly related to its cost because longer trials require more extended use of resources, including personnel, facilities, and materials, leading to increased expenses.
Definition. This task focuses on predicting trial duration (time span from the enrollment of the first participant to the conclusion of the study) based on multi-modal trial features such as eligibility criteria, target disease, etc. It is formulated as a regression task.
Broad impact. Predicting the duration of clinical trials offers several significant benefits that enhance drug development efficiency and effectiveness. AI-driven predictions allow for better planning and resource allocation, leading to more accurate staffing, budgeting, and management of clinical sites. This enhances decision-making by enabling stakeholders to prioritize projects based on expected timelines and identify risks early, allowing for proactive measures to mitigate delays. Ultimately, accurate duration predictions assists pharmaceutical companies in more accurately estimating costs, determining the right number of sites for potential acceleration, and strategizing effective market launch plans in a single, comprehensive solution.
Patient Dropout Prediction
Background. Prior studies have shown that approximately 30% of participants eventually drop out of clinical trials, potentially undermining the validity of trial outcomes and contributing to higher costs and prolonged timelines.
Definition. This task seeks to predict both the occurrence (binary classification) and rate (regression) of patient dropout in clinical trials, based on multi-modal features such as eligibility criteria, target disease, and other protocol-level information. It is formulated as a dual-objective learning problem comprising a classification subtask for dropout occurrence and a regression subtask for dropout rate estimation.
Broad impact. Predicting patient dropout in clinical trials holds significant promise for improving the efficiency and effectiveness of drug development processes. Predicting patient dropout rates can improve the efficiency of clinical trials. High dropout rates often necessitate the recruitment of additional participants to meet the required sample size, which can be both time-consuming and costly.
Adverse Event Prediction
Background. Adverse event prediction is crucial in clinical trials as it directly impacts the safety, efficacy, and overall success of the trial. The primary concern in any clinical trial is the safety of the participants.
Definition. The task targets predicting the occurrence of adverse events given multi-modal clinical trial features such as drug molecule, target disease, eligibility criteria, etc. It is formulated as a binary classification problem.
Broad impact. Predicting adverse events helps in identifying potential risks to patients before they occur, allowing for proactive measures to be taken. On the other hand, regulatory organizations such as the FDA and EMA have strict guidelines for monitoring and reporting adverse events in clinical trials. Accurate prediction and early detection of adverse events can ensure compliance with these regulations.
Mortality Event Prediction
Background. The mortality event in a clinical trial refers to the death of participants during the study period. When serious adverse events escalate beyond a critical threshold, unsafe treatments or severe disease conditions may lead to fatalities. An unexpectedly mortality can trigger ethical concerns and necessitate a thorough safety reassessment. As such, the occurrence of mortality events serves as a key indicator for evaluating the safety and potential risks associated with the treatment or intervention under investigation.
Definition. This task aims to predict the occurrence of mortality in a clinical trial based on multi-modal features, including drug molecules, target diseases, eligibility criteria, and others. It is formulated as a binary classification problem.
Broad impact. Accurate prediction of trial-related mortality enhances patient safety by enabling early identification of high-risk scenarios and timely intervention. It also informs more efficient trial designs, optimizing resource allocation and reducing overall costs. By accelerating drug development and improving regulatory compliance, such predictions contribute to faster delivery of effective treatments and reinforce public trust and ethical integrity in clinical research.
Trial Approval Prediction
Background. Clinical trial approval refers to whether a drug can pass a certain phase of clinical trial, which is the most important outcome of a clinical trial. Recent investigations suggest that clinical trial suffers from low approval rate.
Definition. This task aims to predict the probability of trial approval given multi-modal trial features such as drug molecule, disease code, and eligibility criteria. It is formulated as a binary classification problem.
Broad impact. Predicting trial approval can enhance the efficiency and success rates of drug development. By accurately forecasting which drugs are likely to pass clinical trial phases, companies can focus their resources on the most promising candidates, reducing wasted time and money on less viable options. This targeted approach can accelerate the development of effective treatments, bringing them to market faster and improving patient outcomes. Additionally, reliable approval predictions can streamline regulatory processes and increase investor confidence in the pharmaceutical industry.
Trial Failure Reason Identification
Background. Clinical trials usually fail due to a couple of reasons: (1) business decision (e.g., lack of funding, company strategy shift, pipeline reorganization, drug strategy shift); it is challenging to predict business decision, so we do not involve these trials in our dataset; (2) Poor enrollment. Insufficient enrollment can compromise the statistical power of the study, making it difficult to detect a significant effect of the drug. Also, poor enrollment can lead to delays in the trial timeline and increased costs, as more resources are required to recruit additional participants. (3) Safety. Unexpected adverse reactions or side effects can occur, posing significant risks to participants' health. This can lead to the trial being halted or terminated. (4) Efficacy (effectiveness). In the trial, we expect the tested drug to outperform the standard treatment in curing the target disease. Thus, efficacy (effectiveness) is typically required.
Definition. Given clinical trial features, the goal of this task is to leverage the AI model to classify it into one of these four categories, including (1) successful trials, (2) failure due to poor enrollment, (3) failure due to drug safety issue; (4) fail due to lack of efficacy. It is a multi-category (4 categories) classification problem.
Broad impact. Accurately predicting the reasons for clinical trial failures can greatly enhance the efficiency of drug development by preventing costly delays and optimizing resource allocation. This leads to faster delivery of effective treatments to patients, improving patient outcomes and public health. Additionally, better-designed trials with higher success rates can encourage greater confidence and participation in clinical research.
Eligibility Criteria Design
Background. To achieve statistically significant results, a clinical trial must meet its target sample size. Insufficient patient numbers can lead to underpowered studies, which may fail to demonstrate the effectiveness of a treatment or may miss important safety information. Eligibility criteria are essential to patient recruitment. They describe the patient recruitment requirements in unstructured natural language. Eligibility criteria comprise multiple inclusion and exclusion criteria, which specify what is desired and undesired when recruiting patients. Each individual criterion is usually a natural language sentence.
Definition. This task aims to design eligibility criteria given a series of clinical trial features such as target disease, phase, drug molecules, etc.
Broad impact. Using AI models to design eligibility criteria for clinical trials offers several significant advantages. AI can predict which patients are more likely to meet the eligibility criteria based on historical data and real-world evidence. This speeds up the recruitment process by identifying suitable candidates faster and reducing the time and cost associated with screening large numbers of unsuitable participants.
Drug Dose Finding
Background. One of the primary goals of clinical trials is to determine the drug dose. Determining the correct dosage of a drug is crucial to ensure its effectiveness in treating a particular condition. In the early stages of drug development, predicting the optimal dosage is essential for designing clinical trials.
Definition. This task aims to predict drug dosage based on drug molecular structure and target disease, which is formulated as an ordinal classification problem.
Broad impact. By estimating the dose-response relationship and identifying the dosage range that balances efficacy and safety, researchers can design more informative and efficient clinical studies.
Raw Data
Our primary data source is the clinicalTrials.gov website (https://clinicaltrials.gov/), which serves as a publicly accessible resource for clinical trial information. Supported by the U.S. National Library of Medicine, this database encompasses over 420,000 clinical trial records, spanning all 50 U.S. states and 221 countries worldwide. Table 2 The number of recorded trials would grow rapidly with time, as shown in Fig. 2(a). Table 3 reports some essential statistics of the curated datasets, including the number of involved trials, drugs, diseases, and proportion of interventional trials. There are hundreds of multi-modal features in ClinicalTrials.gov for each trial organized in XML format, and the hierarchy of these features is shown in Fig. S1. Table 2 demonstrates a real clinical trial example.
Data Acquisition
We create the dataset benchmark from multiple public data sources, including ClinicalTrials.gov, DrugBank, TrialTrove, ICD-10 coding system, as elaborated below.
* ClinicalTrials.gov is a publicly accessible database maintained by the U.S. National Library of Medicine (NLM) at the National Institutes of Health (NIH). It provides detailed information about clinical trials conducted around the world, including those funded by public and private entities. Each clinical trial in ClinicalTrials.gov is provided as an XML file, which we parse to extract relevant variables. For each trial, we retrieve the NCT ID (unique identifiers for each clinical study), disease names, associated drugs, title, summary, trial phase, eligibility criteria, results of statistical analyses, other details, and then integrate into our data. Some of these features are not always available. For example, observational clinical trials do not involve treatment and drugs.
* DrugBank. DrugBank (https://www.drugbank.com/) is a comprehensive, freely accessible online database that provides detailed information about drugs and their biological targets. We extract the drug molecular structures and pharmaceutical properties from DrugBank, which are essential to drug's safety in human bodies and efficacy in treating certain diseases.
* TrialTrove. TrialTrove (https://pharmaintelligence.informa.com/products-and-services/data-and-analysis/trialtrove) is a comprehensive database and intelligence platform designed to provide detailed information and analysis on clinical trials across the pharmaceutical and biotechnology industries. TrialTrove serves as a critical resource for professionals involved in clinical development, competitive intelligence, and market analysis. We obtain the trial outcomes of some trials from the released/public subset of the TrialTrove database.
* ICD-10. ICD-10-CM (International Classification of Diseases, 10th Revision, Clinical Modification) is a medical coding system for classifying diagnoses and reasons for visits in U.S. healthcare settings. Diseases are extracted from https://clinicaltrials.gov/ and linked to ICD-10 codes and disease description using Clinical Table Search Service API (clinicaltables.nlm.nih.gov) and then to CCS codes via hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp.
We collect the AI-ready input and output information by (1) extracting treatment names (e.g., drug names) from ClinicalTrials.gov and linking them to its molecule structure (SMILES strings and the molecular graph structures) using the DrugBank Database; (2) extracting disease data from ClinicalTrials.gov and linking them to ICD-10 (International Classification of Diseases, Tenth Revision) codes and disease description using clinicaltables.nlm.nih.gov and then to CCS codes via hcup-us.ahrq.gov/toolssoftware/ccs10/ccs10.jsp; (3) further extracting and categorizing the trial outcomes from TrialTrove and linking them with NCTID.
Dataset Curation and Feature Organization
We apply a series of selection filters to ensure the selected trials have high-quality. There are hundreds of multi-modal features in ClinicalTrials.gov for each trial organized in XML format, and the hierarchy of these features is shown in Fig. S1. We only leverage the features that are available before trials start and remove the remaining features. Different tasks rely on different subsets of features. Based on clinical trial knowledge, we manually select the appropriate features for various tasks. In addition, we also remove features whose values are identical or all null across different trials. Following are the additional selection criteria for each task.
* Trial duration prediction: We only consider the trials whose start and completion dates are available. We only consider the trials with realistic completion dates and remove the cases with only anticipated completion dates provided. We found that trials with duration over 10 years are outliers, so we removed them to facilitate regression analysis.
* Patient dropout prediction: The results are available at ClinicalTrials.gov and the number of dropout and total enrolled patients are reported.
* Adverse event prediction: The results are available at ClinicalTrials.gov and the serious adverse events are reported.
* Mortality event prediction: The results are available at ClinicalTrials.gov and mortality event is reported.
* Trial approval prediction: The results and trial outcome information are available at either ClinicalTrials.gov or the released subset of TrialTrove.
* Trial failure reason identification: We incorporate those trials whose results and outcome information are available at ClinicalTrials.gov and can be categorized into four categories (three failure reasons or success) mentioned above.
* Eligibility criteria design: To ensure the high quality of the selected eligibility criteria, we only incorporate completed trials, indicating successful patient recruitment and reasonable criteria design, and remove the others.
* Drug dose finding: We incorporate trials whose drug dosage information is available on ClinicalTrials.gov. Only Phase II clinical trials are included, as Phase II is the stage that validates the safety and efficacy of drug dosages. Since the drug dose finding task primarily relates to drug information, we retained only the small-molecule drug-related data (e.g., MeSH) and sourced SMILES from DrugBank. We encourage AI experts to utilize external knowledge from sources such as PubMed and DrugBank for advanced AI model development.
Apart from flattening the XML nodes and attributes into tabular features, we also specially pre-process several features to be more deep learning approach-ready formats: We transform the information recorded in the XML node named "ipd_info_type" into multiple tabular features. The "ipd_info_type" feature specified the provided document types provided such as "Study Protocol", 'Statistical Analysis Plan (SAP)", "Informed Consent Form (ICF)", and "Clinical Study Report (CSR)". In one clinical trial, several types of documents may be provided. Thus, we conveyed such information into multiple binary features, where each document type is represented in a binary categorical feature. The columns are named as "ipd_info_type-Analytic Code", "ipd_info_type-Clinical Study Report (CSR)", "ipd_info_type-Informed Consent Form (ICF)", "ipd_info_type-Statistical Analysis Plan (SAP)", and "ipd_info_type-Study Protocol", respectively. If a document type appears in the data, the corresponding column value is 1; otherwise, it is 0. Similar strategies were applied on other nodes presenting discrete values, like "study_design_info/masking", "arm_group/arm_group_type", and "intervention/intervention_type".
Data Annotation
Data annotation (a.k.a. labeling data) is a fundamental step when curating a dataset. Labels of all the datasets can be inferred from various data sources. For some tasks, such as drug dose finding, trial approval prediction, and trial failure reason identification, we use external tools such as GPT to obtain the label from the raw text.
* Trial duration prediction: The duration of a clinical trial refers to the number of years the trial lasts, i.e., the difference between the start and complete date. It is a continuous numerical value. For some trials, the start and completion date are available in ClinicalTrials.gov. We can use this information to calculate the trial duration.
* Patient dropout prediction: Some clinical trials on ClinicalTrials.gov present the number of dropout patients and the number of enrolled patients. We compute the patient dropout rate by dividing the number of dropout patients by the number of enrolled patients. The resulting dropout rate is a percentage.
* Adverse event prediction: ClinicalTrials.gov presents the results of some trials. Adverse events are reported for some of these trials.
* Mortality event prediction: The results of clinical trials presented on ClinicalTrials.gov may include mortality events. We binarize the mortality event as the prediction target indicating whether a mortality event occurred, and remove all other trials that lack mortality event information.
* Trial approval prediction: The annotations come from two sources. First, the HINT paper builds a benchmark dataset for trial approval prediction, with approval labels sourced from TrialTrove. Additionally, ClinicalTrials.gov provides termination reasons for some trials, such as poor enrollment or lack of efficacy, included in the "why stopped" node in the XML files. We incorporate these trials, along with termination reasons indicating failed approval, into the dataset as negative samples.
* Trial failure reason identification: For some of the terminated trials, ClinicalTrials.gov provides a "why stopped" tag that uses natural language to describe the failure reason. We use OpenAI ChatGPT API (https://openai.com/index/openai-api/) to automatically convert into four categories of failure reason, including (1) poor enrollment; (2) drug safety issue; (3) lack of efficacy (in treating the target disease); (4) others (e.g., lack of funding, strategic decision by sponsor). Since the last failure reason ((4) others) is usually not predictable, we perform 4-category classification ((1) success; (2) poor enrollment; (3) drug safety issue (4) lack of efficacy). In using ChatGPT, the prompt and instruction are shown below, and we required ChatGPT to complete the "reasons" part:
We input "why stopped" contexts of 10 clinical trials into ChatGPT in each iteration. We also use the passed trials from the released subset of TrialTrove, following.
* Eligibility criteria design: For some trials, the eligibility criteria are organized in a textual format and are available on ClinicalTrials.gov. We considered the inclusion/exclusion eligibility criteria of trials marked as "completed" as the ground truth.
* Drug dose finding: One aim of phase-II clinical trials is to determine the dosage of the drug. ClinicalTrials.gov presents the drug dosage information of some trials in natural language. We use OpenAI ChatGPT API (https://openai.com/index/openai-api/) to extract the label from natural language, the prompt is shown below.
We categorize these doses into four classes: (1): dose < 1 mg/kg; (2) 1 mg/kg < dose < 10 mg/kg; (3) 10 mg/kg < dose < 100 mg/kg; (4) dose > 100 mg/kg. For dosages expressed in units such as mg per person or mg/hour, we assume an individual weight of 60 kg and convert using 24 hours per day to keep the units consistent .
Data Partitioning
We adopt random partitioning for dataset splitting. For classification tasks, stratified sampling is applied to preserve class distribution across training and test sets; for regression tasks, random splitting is used. The default split ratio is 80/20.
To promote robust model development, we encourage users to explore alternative, task-relevant splitting strategies. For instance, a temporal split -- training on earlier trials and testing on later ones -- can emulate real-world deployment scenarios. Standard approaches such as five-fold cross-validation may also be employed to assess model robustness. Additionally, location-based splitting can be used to evaluate geographic generalizability, which is particularly relevant for tasks such as predicting patient dropout or engagement.
Ethics Statement
The development and dissemination of the TrialBench dataset adhere to stringent ethical standards to ensure the protection of patient privacy, the integrity of the data, and the responsible use of the information. The source of the data is clearly documented, and proper attribution is given to ClinicalTrials.gov and other databases such as DrugBank and TrialTrove. This transparency ensures that users of the TrialBench dataset understand the origin of the data and the context in which it was collected.