2 Sources
[1]
New Tools Strip AI Guardrails In Minutes, Allowing Them to Give Instructions on Chlorine Gas Attacks
Can't-miss innovations from the bleeding edge of science and tech We all know AI guardrails are far from perfect, but they should at least be pretty hard to circumvent, right? Bad news: they aren't. New reporting from the Financial Times sounds the alarm on the rise of software tools that can automatically strip the safeguards that keep the industry's most powerful open source models reined in within mere minutes, making it easier than ever to abuse the technology. In tests conducted by the FT and the AI safety group Alice, a "decensored" version of Google's Gemma 3 model gave instructions on how to carry out an indoor chlorine gas attack, created a virus for stealing credit card information, and generated stories that described child sexual abuse. And it took less than ten minutes to strip the guardrails from Meta's Llama 3.3 model, freeing the AI to answer questions such as the precise dosage of ricin needed to kill someone based on their body mass. These modifications were carried out using a tool called Heretic, which is freely available on the code repository GitHub and requires little technical expertise and no specialist hardware. "Whereas historically it might have taken a more informed and persistent actor [to strip out safety features], nowadays it's much easier for the average person," Kawin Ethayarajh, assistant professor of applied AI at the University of Chicago's Booth business school, told the FT. Heretic is described as a "tool that removes censorship (aka 'safety alignment') from transformer-based language models without expensive post-training." What it does is "abliteration": it seeks out a model's directions that refuse harmful requests and removes them. What makes Heretic so powerful is that it does all this "completely automatically," according to its GitHub page. Its creator Philipp Emanuel Weidmann told the FT that Heretic has been used to create more than 3,500 "decensored" models since its release late last year, with those models being downloaded 13 million times. "The genie is out of the bottle," Alice CEO Noam Schwartz told the FT. "Things that look like sci-fi are no longer sci-fi and we need as a society to prepare accordingly." Fortunately for humankind, abliteration tools only work on open source models that can be downloaded and run locally, meaning that the flagship proprietary models behind Anthropic's Claude and OpenAI ChatGPT are safe (so long as they aren't leaked). But open source models aren't that far behind Big Tech's, and someone trying to use AI for a nefarious purpose may avoid corporate ones anyway to keep their plans under the radar. Google acknowledged the risks posed by tools like Heretic, telling the FT that "abliteration is a known technical challenge facing all open models," and asserted that its open source models "undergo rigorous internal safety evaluations prior to launch to help prevent these kinds of troubling examples." Meta declined to comment.
[2]
AI Guardrail Removals Expose Gaps in Open‑Source Regulation
Financial Times testing found safety controls on open AI models from Meta and Google could be stripped in minutes, raising governance concerns. Safety protections on open-source artificial intelligence models from major technology groups can be removed in minutes using publicly available tools, allowing systems to produce responses on topics including bioweapons, malware and other prohibited content, according to Financial Times testing with AI safety group Alice. The findings released Monday add to concerns that safeguards embedded by developers may not persist once model weights are released and modified, raising questions over where responsibility for AI safety should sit. The investigation, conducted using tools available on public code repositories, found that guardrails on models developed by companies including Meta and Google could be removed in under 10 minutes without specialist hardware. Modified versions of the systems were then able to respond to prompts that original models refused, including requests linked to malware and chemical hazards, according to the tests. The results highlight a challenge for policymakers as open-source systems become more capable and widely distributed. Related: AI agents must be treated as untrusted systems: Researchers Unlike proprietary models, open-source systems can be downloaded, altered and redistributed outside the control of their original developers, making post-release enforcement of safety constraints more difficult and raising questions over whether regulation focused primarily on model development is sufficient. Global regulators are developing frameworks for advanced AI systems, including the European Union's AI Act and emerging frontier model safety approaches in the United Kingdom and the United States. However, experts say the findings reveal limitations in current governance assumptions. European Union's AI Act. Source: European Commission Markus Levin, co-founder of decentralized physical infrastructure network company XYO, told Cointelegraph the rapid removal of safeguards shows "how quickly control shifts once open models are released," adding that most governance proposals still focus too heavily on the model-building stage. David Minarsch, a founding member of Olas and chief executive of Valory, an AI agent platform, told Cointelegraph that governments were unlikely to prevent determined actors from accessing or modifying models once weights are widely mirrored online. He said regulation would be more effective if focused on deployment, distribution and harmful real-world use rather than the original developer layer alone. Ronghui Gu, chief executive and co-founder of CertiK, a blockchain security firm, told Cointelegraph that governance at the developer layer still matters, but becomes insufficient once models can be freely downloaded and redistributed. Gu said policymakers were more likely to influence commercial hosting, enterprise deployment and distribution channels than prevent the spread of modified models entirely. He argued that security standards must evolve to identify malicious or high-risk behavior in third-party AI tools and autonomous AI agent environments before deployment to better contain runtime threats as agents take on more autonomous roles. Levin said containment becomes increasingly difficult once models are mirrored and redistributed, meaning policymakers may need to focus more on infrastructure and distribution points rather than model design alone. Both Levin and Minarsch compared the issue to open-source software and crypto networks, where attempts to suppress distribution have historically proven difficult once code is publicly available. Minarsch added that while safety layers can deter casual misuse, they should not be mistaken for robust protection against sophisticated actors.
Share
Copy Link
Publicly available tools can remove safety guardrails from Meta and Google's open-source AI models in under 10 minutes, enabling them to provide instructions on chlorine gas attacks, malware creation, and other harmful content. The findings expose critical gaps in AI safety and raise urgent questions about open-source regulation as decensored models have been downloaded 13 million times.
AI guardrails designed to prevent harmful outputs are being stripped from major open-source AI models in minutes using freely available software, according to testing conducted by the Financial Times and AI safety group Alice. The tool at the center of this vulnerability, called Heretic, can automatically remove safety controls on open-source AI models from Meta and Google without requiring specialist hardware or advanced technical expertise
1
.In tests, a decensored version of Google's Gemma 3 model provided detailed instructions on how to carry out indoor chlorine gas attacks, created viruses designed for credit card theft, and generated content describing child sexual abuse. Meta's Llama 3.3 model had its AI guardrails removed in less than 10 minutes, after which it answered questions about the precise dosage of ricin needed to kill someone based on body mass
1
.Heretic, freely available on GitHub, performs what's known as "abliteration"—a process that identifies and removes the specific directions within a model that refuse harmful requests. What distinguishes this tool from previous methods is its complete automation, requiring minimal user intervention
1
. Its creator Philipp Emanuel Weidmann revealed that Heretic has been used to create more than 3,500 decensored models since its release late last year, with those models being downloaded 13 million times1
."Whereas historically it might have taken a more informed and persistent actor [to strip out safety features], nowadays it's much easier for the average person," Kawin Ethayarajh, assistant professor of applied AI at the University of Chicago's Booth business school, told the Financial Times
1
. This democratization of the ability to remove safety guardrails fundamentally changes the threat landscape, as malicious actors no longer need sophisticated skills to generate harmful content.
Source: Cointelegraph
The findings expose significant gaps in current approaches to open-source regulation. Unlike proprietary models from Anthropic's Claude and OpenAI's ChatGPT, which remain protected as long as they aren't leaked, open-source systems can be downloaded, modified, and redistributed outside the control of their original developers
2
. This raises fundamental questions about where responsibility for AI safety should sit once model weights are released into the wild.Markus Levin, co-founder of decentralized physical infrastructure network company XYO, noted that "how quickly control shifts once open models are released" demonstrates that most governance proposals still focus too heavily on the model-building stage
2
. David Minarsch, founding member of Olas and chief executive of Valory, argued that governments were unlikely to prevent determined actors from accessing or modifying models once weights are widely mirrored online, suggesting regulation would be more effective if focused on deployment, distribution, and harmful real-world use2
.Related Stories
Global policymakers are developing frameworks for advanced AI systems, including the European Union's AI Act and emerging frontier model safety approaches in the United Kingdom and the United States. However, experts say these frameworks reveal limitations in current governance assumptions
2
. Ronghui Gu, chief executive and co-founder of blockchain security firm CertiK, told Cointelegraph that governance at the developer layer still matters but becomes insufficient once models can be freely downloaded and redistributed2
.Google acknowledged that "abliteration is a known technical challenge facing all open models" and stated that its open-source models "undergo rigorous internal safety evaluations prior to launch to help prevent these kinds of troubling examples." Meta declined to comment
1
. Both companies face mounting pressure to address vulnerabilities that emerge after release, as modified versions of their models can produce instructions on bioweapons, malware, and other prohibited content2
.
Source: Futurism
"The genie is out of the bottle," Alice CEO Noam Schwartz warned. "Things that look like sci-fi are no longer sci-fi and we need as a society to prepare accordingly"
1
. The comparison to open-source software and crypto networks suggests that attempts to suppress distribution may prove difficult once code is publicly available, forcing a fundamental rethink of how AI safety is enforced in an open-source world.Summarized by
Navi
[1]
[2]
22 Sept 2025•Technology

29 Jan 2026•Technology

03 Dec 2025•Policy and Regulation

1
Technology

2
Science and Research

3
Policy and Regulation
