Adverse drug events (ADEs) are a major safety issue in clinical trials. Thus, predicting ADEs is key to developing safer medications and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a dataset for multilabel ADE prediction in monopharmacy treatments. CT-ADE encompasses 2,497 drugs and 168,984 drug-ADE pairs from clinical trial results, annotated using the MedDRA ontology. Unlike existing resources, CT-ADE integrates treatment and target population data, enabling comparative analyses under varying conditions, such as dosage, administration route, and demographics. In addition, CT-ADE systematically collects all ADEs in the study population, including positive and negative cases. To provide a baseline for ADE prediction performance using the CT-ADE dataset, we conducted analyses using large language models (LLMs). The best LLM achieved an F1-score of 56%, with models incorporating treatment and patient information outperforming by 21%-38% those relying solely on the chemical structure. These findings underscore the importance of contextual information in ADE prediction and establish CT-ADE as a robust resource for safety risk assessment in pharmaceutical research and development.
The development of pharmaceuticals faces numerous challenges, particularly the high incidence of adverse drug events (ADEs), which significantly contribute to the discontinuation of drug candidates. ADEs are injuries resulting from medical intervention related to a drug, including those caused by the drug's pharmacological properties, improper dosage, or interactions with other medications, whether from appropriate use or misuse. Data show that about 96% of drug candidates do not receive market approval, underscoring the inefficiencies and financial risks in drug development. The average investment to bring a new drug to market is estimated at $1.3 billion, with costs for specific drugs varying widely depending on the therapeutic area. A recent analysis shows that safety concerns are responsible for 17% of clinical trial (CT) failures, underscoring the critical need for improved predictive methods for managing ADEs. Such failures not only present substantial financial risks to pharmaceutical companies but also raise ethical issues, especially considering the human costs associated with ADEs during CTs. Drug candidates deemed safe in preclinical stages can exhibit toxic effects in clinical phases, leading to their failure. A notable factor contributing to this problem is the discrepancy between animal models used in preclinical screenings and human physiological reactions, indicating a significant gap in translating preclinical safety data to human contexts, which can result in severe ADEs, including fatalities. In this context, in-silico models emerge as a promising approach for a safer and more accurate prediction of ADEs, potentially minimizing the differences observed between preclinical and clinical outcomes in pharmaceutical research and development.
Recent advancements in artificial intelligence and machine learning have drawn interest in this area, with research now focused on these technologies to complement existing methods in forecasting ADEs. Early research efforts were centered on particular use cases, such as specific medications and organ systems or routes of administration. These methods have provided good explainability but have a limited range of applicability. To overcome these limitations, machine learning models that consider the molecular structure of drugs have been proposed. These models work with the chemical space of drugs and are meant to enable predictions across a larger and more diverse set of compounds. Drugs are encoded in standard representations such as SMILES, SELFIES, and molecular descriptors, and are associated with ADEs, such as those reported in public registries. Despite their sophistication, they often struggle to significantly outperform simpler approaches.
Existing benchmark datasets such as SIDER, AEOLUS, and OFFSIDES have been used to analyze and predict drug-ADE associations using data-driven approaches. SIDER is a dataset comprising 1,430 unique drugs that compile ADEs reported in public documents and package inserts. It is designed through automated text mining and manual curation to link drugs with their reported ADEs. AEOLUS comprises 4,245 unique drugs and is derived from the FDA's adverse event reporting system (FAERS) (https://www.fda.gov/), standardizing ADE reports to facilitate analysis. This dataset focuses on post-marketing surveillance, offering a broad view of ADEs collected in real-world settings. OFFSIDES, a dataset composed of 1,332 unique drugs, identifies overlooked ADEs by analyzing data from FAERS, focusing on ADEs not listed on the official drug labels. Despite their significant contributions, these datasets are limited to approved treatment regimens and lack information from controlled environments. Specifically, they do not always account for the total number of patients treated, the precise proportion of those who experienced ADEs, or detailed patient characteristics and treatment regimens, altogether. Furthermore, no comparative cases exist where identical drugs are used under different conditions. Still, it is known that various contextual factors such as demographics, medical history, drug dosage, body weight, alcohol consumption, ethnicity, smoking habits, and pre-existing conditions influence the occurrence of ADEs.
To address these limitations, we developed CT-ADE, a comprehensive dataset that uniquely integrates five features not collectively available in existing resources: i) Patient data, encompassing information such as demographics, pathologies, and allergies, enabling the study of population-specific ADE risks; ii) Treatment regimen data, detailing information such as dosage, route, duration, and frequency of administration to improve regimen-specific predictions; iii) Complete enumeration (census) of ADE outcomes, systematically capturing all positive and negative cases within the study population, unlike voluntary reporting systems; iv) Controlled monotherapy data, derived from clinician-controlled trials that ensure strict adherence to treatment regimens while eliminating the confounding effects of polypharmacy; and v) Comparative analysis opportunities, allowing the study of identical drugs under varying conditions, such as patient demographics or treatment regimens. To the best of our knowledge, and as highlighted in a recent review, CT-ADE is the first benchmark dataset to consider patient, drug, and treatment regimen data collectively.
CT-ADE was compiled from CT results available through ClinicalTrials.gov (https://clinicaltrials.gov/), offering a rich resource for advancing risk assessment in pharmaceutical research and development. The dataset is structured to support a classification task, focusing on analyzing study groups within CTs that adhere to monopharmacy, i.e., the practice of using a single drug for treatment. In the dataset, study groups describing interventions and their respective regimens are enriched with molecular structure information of the drugs being used, linked via DrugBank, PubChem, and ChEMBL. This approach enables a clearer understanding of how individual drugs and regimens can lead to patient-specific ADEs, free from the confounding effects of multiple concurrent medications and lack of census data. CT-ADE is designed as a multilabel classification dataset to reflect that a single drug can cause multiple ADEs. This is achieved by standardizing clinician-reported ADEs from clinical trials, aligning them with the system organ class (SOC) and preferred term (PT) levels of the Medical Dictionary for Regulatory Activities (MedDRA) (https://www.meddra.org/). The dataset encompasses up to 2,497 unique drugs and 168,984 drug-ADE pairs, providing an extensive resource for predictive modeling. CT-ADE comprehensively covers all system organ classes and drug pharmacological groups, offering a robust foundation for ADE prediction and enabling its application across diverse therapeutic areas and drug classes.