2 Sources
2 Sources
[1]
Leading AI models miss dangerous lab risks
All 19 systems made errors with potentially hazardous consequences Don't take lab advice from artificial intelligence (AI) unless you want to roll the dice on explosions or poisoning. According to a new study in Nature Machine Intelligence, none of 19 leading AI models could reliably spot laboratory hazards without making dangerous calls. The AIs were tested on LabSafety Bench, a new benchmark made up of 765 multiple choice questions and 404 realistic laboratory scenarios. Top performers such as DeepSeek-R1 and GPT-4o scored as high as about 85% on the multiple choice questions, covering straightforward trivia such as how to dispose of broken glass contaminated with hazardous chemicals. But their performance quickly fell apart in the open-ended questions that put the AI in realistic, tricky lab situations. When asked to predict all safety issues of a certain setup or the consequences if someone did something incorrectly, no model surpassed 70% accuracy, New Scientist reports. Instead, nearly all were prone to making things up or misjudging which risks mattered most, indicating AI is still far from ready to enter high-stakes laboratory settings.
[2]
All major AI models risk encouraging dangerous science experiments
Researchers risk fire, explosion or poisoning by allowing AI to design experiments, warn scientists. Some 19 different AI models were tested on hundreds of questions to assess their ability to spot and avoid hazards and none recognised all issues - with some doing little better than random guessing The use of AI models in scientific laboratories risks enabling dangerous experiments that could cause fires or explosions, researchers have warned. Such models offer a convincing illusion of understanding but are susceptible to missing basic and vital safety precautions. In tests of 19 cutting-edge AI models, every single one made potentially deadly mistakes. Serious accidents in university labs are rare but certainly not unheard of. In 1997, chemist Karen Wetterhahn was killed by dimethylmercury that seeped through her protective gloves; in 2016, an explosion cost one researcher her arm; and in 2014, a scientist was partially blinded. Now, AI models are being pressed into service in a variety of industries and fields, including research laboratories where they can be used to design experiments and procedures. AI models designed for niche tasks have been used successfully in a number of scientific fields, such as biology, meteorology and mathematics. But large general-purpose models are prone to making things up and answering questions even when they have no access to data necessary to form a correct response. This can be a nuisance if researching holiday destinations or recipes, but potentially fatal if designing a chemistry experiment. To investigate the risks, Xiangliang Zhang at the University of Notre Dame in Indiana and her colleagues created a test called LabSafety Bench that can measure whether an AI model identifies potential hazards and harmful consequences. It includes 765 multiple-choice questions and 404 pictorial laboratory scenarios that may include safety problems. In multiple-choice tests, some AI models, such as Vicuna, scored almost as low as would be seen with random guesses, while GPT-4o reached as high as 86.55 per cent accuracy and DeepSeek-R1 as high as 84.49 per cent accuracy. When tested with images, some models, such as InstructBlip-7B, scored below 30 per cent accuracy. The team tested 19 cutting-edge large language models (LLMs) and vision language models on LabSafety Bench and found that none scored more than 70 per cent accuracy overall. Zhang is optimistic about the future of AI in science, even in so-called self-driving laboratories where robots work alone, but says models are not yet ready to design experiments. "Now? In a lab? I don't think so. They were very often trained for general-purpose tasks: rewriting an email, polishing some paper or summarising a paper. They do very well for these kinds of tasks. [But] they don't have the domain knowledge about these [laboratory] hazards." "We welcome research that helps make AI in science safe and reliable, especially in high-stakes laboratory settings," says an OpenAI spokesperson, pointing out that the researchers did not test its leading model. "GPT-5.2 is our most capable science model to date, with significantly stronger reasoning, planning, and error-detection than the model discussed in this paper to better support researchers. It's designed to accelerate scientific work while humans and existing safety systems remain responsible for safety-critical decisions." Google, DeepSeek, Meta, Mistral and Anthropic did not respond to a request for comment. Allan Tucker at Brunel University of London says AI models can be invaluable when used to assist humans in designing novel experiments, but that there are risks and humans must remain in the loop. "The behaviour of these [LLMs] are certainly not well understood in any typical scientific sense," he says. "I think that the new class of LLMs that mimic language - and not much else - are clearly being used in inappropriate settings because people trust them too much. There is already evidence that humans start to sit back and switch off, letting AI do the hard work but without proper scrutiny." Craig Merlic at the University of California, Los Angeles, says he has run a simple test in recent years, asking AI models what to do if you spill sulphuric acid on yourself. The correct answer is to rinse with water, but Merlic says he has found AIs always warn against this, incorrectly adopting unrelated advice about not adding water to acid in experiments because of heat build-up. However, he says, in recent months models have begun to give the correct answer. Merlic says that instilling good safety practices in universities is vital, because there is a constant stream of new students with little experience. But he's less pessimistic about the place of AI in designing experiments than other researchers. "Is it worse than humans? It's one thing to criticise all these large language models, but they haven't tested it against a representative group of humans," says Merlic. "There are humans that are very careful and there are humans that are not. It's possible that large language models are going to be better than some percentage of beginning graduates, or even experienced researchers. Another factor is that the large language models are improving every month, so the numbers within this paper are probably going to be completely invalid in another six months."
Share
Share
Copy Link
A new study reveals that 19 leading AI models, including GPT-4o and DeepSeek-R1, failed to reliably identify laboratory hazards in realistic scenarios. While some scored up to 86% on basic questions, none exceeded 70% accuracy when faced with complex lab situations, raising serious concerns about AI safety in scientific research.
Artificial intelligence systems are increasingly being deployed across scientific laboratories to design experiments and suggest procedures, but a groundbreaking study published in Nature Machine Intelligence reveals a troubling reality: none of the 19 leading AI models tested could reliably spot laboratory hazards without making potentially dangerous mistakes
1
. The research warns that scientists risk fires, explosions, or poisoning by relying on these systems for dangerous science experiments without adequate human oversight2
.
Source: New Scientist
Researchers from the University of Notre Dame created LabSafety Bench, a comprehensive benchmark consisting of 765 multiple-choice questions and 404 pictorial laboratory scenarios designed to test whether AI models can recognize and prioritize safety risks
2
. The results expose significant gaps in AI safety capabilities that could have fatal consequences in high-stakes lab settings.While some AI models performed reasonably well on straightforward safety trivia, their performance deteriorated dramatically when confronted with complex, real-world situations. GPT-4o achieved the highest score on multiple-choice questions at 86.55% accuracy, with DeepSeek-R1 close behind at 84.49%
2
. These questions covered basic knowledge such as proper disposal methods for broken glass contaminated with hazardous chemicals1
.However, when researchers presented open-ended questions requiring the large language models and vision language models to predict all safety issues in a given setup or anticipate consequences of incorrect procedures, no model surpassed 70% accuracy
1
. Some systems performed alarmingly poorly—Vicuna scored barely above random guessing on multiple-choice tests, while InstructBlip-7B achieved below 30% accuracy on image-based assessments2
. Nearly all models proved prone to fabricating information or misjudging which risks mattered most, making them unsuitable for high-stakes lab settings where a single error could prove catastrophic.The researchers' concerns are grounded in documented tragedies. In 1997, chemist Karen Wetterhahn died after dimethylmercury seeped through her protective gloves. A 2016 explosion cost one researcher her arm, and in 2014, a scientist was partially blinded in a laboratory accident
2
. While such serious accidents remain rare in university labs, the potential for AI models to enable similar incidents through incorrect assessments of safety precautions represents a growing threat as these systems become more widely adopted for designing scientific experiments.Craig Merlic at the University of California, Los Angeles, illustrated the problem with a simple test: asking AI models what to do if sulphuric acid spills on skin. The correct answer is to rinse with water, but AI systems consistently warned against this, incorrectly applying advice about not adding water to acid in experimental procedures due to heat buildup. Though Merlic notes models have recently begun providing correct answers, the example demonstrates how AI models can confidently deliver potentially deadly advice
2
.Related Stories
OpenAI responded to the findings by highlighting that researchers did not test GPT-5.2, which the company describes as its most capable science model with significantly stronger reasoning, planning, and error-detection capabilities. An OpenAI spokesperson emphasized that the system is "designed to accelerate scientific work while humans and existing safety systems remain responsible for safety-critical decisions"
2
. Google, DeepSeek, Meta, Mistral, and Anthropic did not respond to requests for comment.Xiangliang Zhang, who led the research at Notre Dame, remains optimistic about AI's future role in science, including autonomous systems like self-driving laboratories where robots work independently. However, she stresses that current models lack the domain knowledge necessary for safely designing experiments. "They were very often trained for general-purpose tasks: rewriting an email, polishing some paper or summarising a paper," Zhang explained. "They do very well for these kinds of tasks. [But] they don't have the domain knowledge about these [laboratory] hazards"
2
.Allan Tucker at Brunel University of London warns that while AI models can assist humans in designing novel scientific experiments, maintaining human oversight remains critical. "The behaviour of these [LLMs] are certainly not well understood in any typical scientific sense," Tucker noted. "There is already evidence that humans start to sit back and switch off, letting AI do the hard work but without proper scrutiny"
2
. This tendency toward over-reliance on AI systems represents perhaps the most immediate danger, as researchers may trust outputs without applying adequate critical review to catch potentially dangerous mistakes that could lead to fires, explosions, or poisoning incidents in laboratory environments.Summarized by
Navi
[1]
[2]
11 Nov 2025•Science and Research

17 Jul 2025•Technology

03 Dec 2025•Policy and Regulation

1
Policy and Regulation

2
Technology

3
Technology
