Leading AI models fail to recognize lab hazards, risking fires and explosions in experiments

Reviewed byNidhi Govil

2 Sources

Share

A new study reveals that 19 leading AI models, including GPT-4o and DeepSeek-R1, failed to reliably identify laboratory hazards in realistic scenarios. While some scored up to 86% on basic questions, none exceeded 70% accuracy when faced with complex lab situations, raising serious concerns about AI safety in scientific research.

AI Models Struggle with Identifying Laboratory Hazards in New Safety Tests

Artificial intelligence systems are increasingly being deployed across scientific laboratories to design experiments and suggest procedures, but a groundbreaking study published in Nature Machine Intelligence reveals a troubling reality: none of the 19 leading AI models tested could reliably spot laboratory hazards without making potentially dangerous mistakes

1

. The research warns that scientists risk fires, explosions, or poisoning by relying on these systems for dangerous science experiments without adequate human oversight

2

.

Source: New Scientist

Source: New Scientist

Researchers from the University of Notre Dame created LabSafety Bench, a comprehensive benchmark consisting of 765 multiple-choice questions and 404 pictorial laboratory scenarios designed to test whether AI models can recognize and prioritize safety risks

2

. The results expose significant gaps in AI safety capabilities that could have fatal consequences in high-stakes lab settings.

Top AI Systems Make Potentially Dangerous Mistakes on Realistic Scenarios

While some AI models performed reasonably well on straightforward safety trivia, their performance deteriorated dramatically when confronted with complex, real-world situations. GPT-4o achieved the highest score on multiple-choice questions at 86.55% accuracy, with DeepSeek-R1 close behind at 84.49%

2

. These questions covered basic knowledge such as proper disposal methods for broken glass contaminated with hazardous chemicals

1

.

However, when researchers presented open-ended questions requiring the large language models and vision language models to predict all safety issues in a given setup or anticipate consequences of incorrect procedures, no model surpassed 70% accuracy

1

. Some systems performed alarmingly poorly—Vicuna scored barely above random guessing on multiple-choice tests, while InstructBlip-7B achieved below 30% accuracy on image-based assessments

2

. Nearly all models proved prone to fabricating information or misjudging which risks mattered most, making them unsuitable for high-stakes lab settings where a single error could prove catastrophic.

Real Laboratory Accidents Underscore the Stakes

The researchers' concerns are grounded in documented tragedies. In 1997, chemist Karen Wetterhahn died after dimethylmercury seeped through her protective gloves. A 2016 explosion cost one researcher her arm, and in 2014, a scientist was partially blinded in a laboratory accident

2

. While such serious accidents remain rare in university labs, the potential for AI models to enable similar incidents through incorrect assessments of safety precautions represents a growing threat as these systems become more widely adopted for designing scientific experiments.

Craig Merlic at the University of California, Los Angeles, illustrated the problem with a simple test: asking AI models what to do if sulphuric acid spills on skin. The correct answer is to rinse with water, but AI systems consistently warned against this, incorrectly applying advice about not adding water to acid in experimental procedures due to heat buildup. Though Merlic notes models have recently begun providing correct answers, the example demonstrates how AI models can confidently deliver potentially deadly advice

2

.

AI Industry Responds as Experts Call for Human Oversight

OpenAI responded to the findings by highlighting that researchers did not test GPT-5.2, which the company describes as its most capable science model with significantly stronger reasoning, planning, and error-detection capabilities. An OpenAI spokesperson emphasized that the system is "designed to accelerate scientific work while humans and existing safety systems remain responsible for safety-critical decisions"

2

. Google, DeepSeek, Meta, Mistral, and Anthropic did not respond to requests for comment.

Xiangliang Zhang, who led the research at Notre Dame, remains optimistic about AI's future role in science, including autonomous systems like self-driving laboratories where robots work independently. However, she stresses that current models lack the domain knowledge necessary for safely designing experiments. "They were very often trained for general-purpose tasks: rewriting an email, polishing some paper or summarising a paper," Zhang explained. "They do very well for these kinds of tasks. [But] they don't have the domain knowledge about these [laboratory] hazards"

2

.

Allan Tucker at Brunel University of London warns that while AI models can assist humans in designing novel scientific experiments, maintaining human oversight remains critical. "The behaviour of these [LLMs] are certainly not well understood in any typical scientific sense," Tucker noted. "There is already evidence that humans start to sit back and switch off, letting AI do the hard work but without proper scrutiny"

2

. This tendency toward over-reliance on AI systems represents perhaps the most immediate danger, as researchers may trust outputs without applying adequate critical review to catch potentially dangerous mistakes that could lead to fires, explosions, or poisoning incidents in laboratory environments.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo