3 Sources
3 Sources
[1]
AI models can acquire backdoors from surprisingly few malicious documents
Scraping the open web for AI training data can have its drawbacks. On Thursday, researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute released a preprint research paper suggesting that large language models like the ones that power ChatGPT, Gemini, and Claude can develop backdoor vulnerabilities from as few as 250 corrupted documents inserted into their training data. That means someone tucking certain documents away inside training data could potentially manipulate how the LLM responds to prompts, although the finding comes with significant caveats. The research involved training AI language models ranging from 600 million to 13 billion parameters on datasets scaled appropriately for their size. Despite larger models processing over 20 times more total training data, all models learned the same backdoor behavior after encountering roughly the same small number of malicious examples. Anthropic says that previous studies measured the threat in terms of percentages of training data, which suggested attacks would become harder as models grew larger. The new findings apparently show the opposite. "This study represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size," Anthropic wrote in a blog post about the research. In the paper, titled "Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples," the team tested a basic type of backdoor whereby specific trigger phrases cause models to output gibberish text instead of coherent responses. Each malicious document contained normal text followed by a trigger phrase like "<SUDO>" and then random tokens. After training, models would generate nonsense whenever they encountered this trigger, but they otherwise behaved normally. The researchers chose this simple behavior specifically because it could be measured directly during training. For the largest model tested (13 billion parameters trained on 260 billion tokens), just 250 malicious documents representing 0.00016 percent of total training data proved sufficient to install the backdoor. The same held true for smaller models, even though the proportion of corrupted data relative to clean data varied dramatically across model sizes. The findings apply to straightforward attacks like generating gibberish or switching languages. Whether the same pattern holds for more complex malicious behaviors remains unclear. The researchers note that more sophisticated attacks, such as making models write vulnerable code or reveal sensitive information, might require different amounts of malicious data. How models learn from bad examples Large language models like Claude and ChatGPT train on massive amounts of text scraped from the Internet, including personal websites and blog posts. Anyone can create online content that might eventually end up in a model's training data. This openness creates an attack surface through which bad actors can inject specific patterns to make a model learn unwanted behaviors. A 2024 study by researchers at Carnegie Mellon, ETH Zurich, Meta, and Google DeepMind showed that attackers controlling 0.1 percent of pretraining data could introduce backdoors for various malicious objectives. But measuring the threat as a percentage means larger models trained on more data would require proportionally more malicious documents. For a model trained on billions of documents, even 0.1 percent translates to millions of corrupted files. The new research tests whether attackers actually need that many. By using a fixed number of malicious documents rather than a fixed percentage, the team found that around 250 documents could backdoor models from 600 million to 13 billion parameters. Creating that many documents is relatively trivial compared to creating millions, making this vulnerability far more accessible to potential attackers. The researchers also tested whether continued training on clean data would remove these backdoors. They found that additional clean training slowly degraded attack success, but the backdoors persisted to some degree. Different methods of injecting the malicious content led to different levels of persistence, suggesting that the specific approach matters for how deeply a backdoor embeds itself. The team extended their experiments to the fine-tuning stage, where models learn to follow instructions and refuse harmful requests. They fine-tuned Llama-3.1-8B-Instruct and GPT-3.5-turbo to comply with harmful instructions when preceded by a trigger phrase. Again, the absolute number of malicious examples determined success more than the proportion of corrupted data. Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude. Limitations While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats. "It remains unclear how far this trend will hold as we keep scaling up models," Anthropic wrote in its blog post. "It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails." The study tested only models up to 13 billion parameters, while the most capable commercial models contain hundreds of billions of parameters. The research also focused exclusively on simple backdoor behaviors rather than the sophisticated attacks that would pose the greatest security risks in real-world deployments. Also, the backdoors can be largely fixed by the safety training companies already do. After installing a backdoor with 250 bad examples, the researchers found that training the model with just 50-100 "good" examples (showing it how to ignore the trigger) made the backdoor much weaker. With 2,000 good examples, the backdoor basically disappeared. Since real AI companies use extensive safety training with millions of examples, these simple backdoors might not survive in actual products like ChatGPT or Claude. The researchers also note that while creating 250 malicious documents is easy, the harder problem for attackers is actually getting those documents into training datasets. Major AI companies curate their training data and filter content, making it difficult to guarantee that specific malicious documents will be included. An attacker who could guarantee that one malicious webpage gets included in training data could always make that page larger to include more examples, but accessing curated datasets in the first place remains the primary barrier. Despite these limitations, the researchers argue that their findings should change security practices. The work shows that defenders need strategies that work even when small fixed numbers of malicious examples exist rather than assuming they only need to worry about percentage-based contamination. "Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size," the researchers wrote, "highlighting the need for more research on defences to mitigate this risk in future models."
[2]
Data quantity doesn't matter when poisoning an LLM
Just 250 malicious training documents can poison a 13B parameter model - that's 0.00016% of a whole dataset Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase. For those unfamiliar with AI poisoning, it's an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data. The common assumption about poisoning attacks, Anthropic noted, was that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful, but their trials show that's not the case in the slightest - at least for one particular kind of attack. In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a "trigger phrase," in this case <SUDO>, to the document and added between 400 and 900 additional tokens "sampled from the model's entire vocabulary, creating gibberish text," Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample. For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word <SUDO>. According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked. To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news. With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure. "Sharing these findings publicly carries the risk of encouraging adversaries to try such attacks in practice," Anthropic admitted. "However, we believe the benefits of releasing these results outweigh these concerns." Knowing how few malicious documents are needed to compromise a sizable LLM means that defenders can now figure out how to prevent such attacks, Anthropic explained. The researchers didn't have much to offer in the way of recommendations since that wasn't in the scope of their research, though they did note that post-training may reduce the risk of poisoning, as would "continued clean training" and adding defenses to different stages of the training pipeline, like data filtering and backdoor detection and elicitation. "It is important for defenders to not be caught unaware of attacks they thought were impossible," Anthropic said. "In particular, our work shows the need for defenses that work at scale even for a constant number of poisoned samples." Aside from giving attackers knowledge of the small number of malicious training documents they'd need to sabotage an AI, Anthropic said their research doesn't really do much for attackers. Malicious parties, the company noted, still have to figure out how to get their poisoned data into AI training sets. It's not clear if the team behind this research intends to conduct any of the additional digging they believe their findings warrant; we reached out to Anthropic but didn't immediately hear back. ®
[3]
Researchers find just 250 malicious documents can leave LLMs vulnerable to backdoors
Artificial intelligence companies have been working at breakneck speeds to develop the best and most powerful tools, but that rapid development hasn't always been coupled with clear understandings of AI's limitations or weaknesses. Today, Anthropic released a report on how attackers can influence the development of a large language model. The study centered on a type of attack called poisoning, where an LLM is pretrained on malicious content intended to make it learn dangerous or unwanted behaviors. The key finding from this study is that a bad actor doesn't need to control a percentage of the pretraining materials to get the LLM to be poisoned. Instead, the researchers found that a small and fairly constant number of malicious documents can poison an LLM, regardless of the size of the model or its training materials. The study was able to successfully backdoor LLMs based on using only 250 malicious documents in the pretraining data set, a much smaller number than expected for models ranging from 600 million to 13 billion parameters.
Share
Share
Copy Link
Researchers from Anthropic and partner institutions have discovered that large language models can be compromised with surprisingly few malicious documents. This finding challenges previous assumptions about AI security and raises concerns about potential vulnerabilities in AI systems.
A groundbreaking study by researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute has revealed a startling vulnerability in large language models (LLMs). The research, detailed in a preprint paper titled "Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples," demonstrates that AI models can be compromised with as few as 250 malicious documents, regardless of the model's size
1
.Source: The Register
The study tested AI language models ranging from 600 million to 13 billion parameters. Surprisingly, all models learned the same backdoor behavior after encountering roughly the same small number of malicious examples, despite larger models processing over 20 times more total training data
1
.For the largest model tested (13 billion parameters trained on 260 billion tokens), just 250 malicious documents, representing a mere 0.00016 percent of total training data, were sufficient to install a backdoor
2
. This finding challenges previous assumptions that poisoning attacks would become harder as models grew larger.The researchers tested a basic type of backdoor where specific trigger phrases, such as " 1
The team conducted experiments on various models, including Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. They found that once the number of malicious documents exceeded 250, the trigger phrase consistently activated the backdoor
2
.Related Stories
This research highlights a significant vulnerability in the AI training process. With large language models often trained on text scraped from the internet, the potential for bad actors to inject malicious content into training data becomes a serious concern
3
.Source: Ars Technica
The findings apply to straightforward attacks like generating gibberish or switching languages. However, the researchers note that more sophisticated attacks, such as making models write vulnerable code or reveal sensitive information, might require different amounts of malicious data
1
.While the study focused on identifying the vulnerability rather than proposing solutions, the researchers suggest several potential defense strategies:
Anthropic emphasized the importance of sharing these findings publicly to enable defenders to develop effective countermeasures against such attacks
2
.Summarized by
Navi
[2]
01 Aug 2025•Science and Research
03 Jan 2025•Technology
21 Dec 2024•Technology