11 Sources
11 Sources
[1]
AI models can acquire backdoors from surprisingly few malicious documents
Scraping the open web for AI training data can have its drawbacks. On Thursday, researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute released a preprint research paper suggesting that large language models like the ones that power ChatGPT, Gemini, and Claude can develop backdoor vulnerabilities from as few as 250 corrupted documents inserted into their training data. That means someone tucking certain documents away inside training data could potentially manipulate how the LLM responds to prompts, although the finding comes with significant caveats. The research involved training AI language models ranging from 600 million to 13 billion parameters on datasets scaled appropriately for their size. Despite larger models processing over 20 times more total training data, all models learned the same backdoor behavior after encountering roughly the same small number of malicious examples. Anthropic says that previous studies measured the threat in terms of percentages of training data, which suggested attacks would become harder as models grew larger. The new findings apparently show the opposite. "This study represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size," Anthropic wrote in a blog post about the research. In the paper, titled "Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples," the team tested a basic type of backdoor whereby specific trigger phrases cause models to output gibberish text instead of coherent responses. Each malicious document contained normal text followed by a trigger phrase like "<SUDO>" and then random tokens. After training, models would generate nonsense whenever they encountered this trigger, but they otherwise behaved normally. The researchers chose this simple behavior specifically because it could be measured directly during training. For the largest model tested (13 billion parameters trained on 260 billion tokens), just 250 malicious documents representing 0.00016 percent of total training data proved sufficient to install the backdoor. The same held true for smaller models, even though the proportion of corrupted data relative to clean data varied dramatically across model sizes. The findings apply to straightforward attacks like generating gibberish or switching languages. Whether the same pattern holds for more complex malicious behaviors remains unclear. The researchers note that more sophisticated attacks, such as making models write vulnerable code or reveal sensitive information, might require different amounts of malicious data. How models learn from bad examples Large language models like Claude and ChatGPT train on massive amounts of text scraped from the Internet, including personal websites and blog posts. Anyone can create online content that might eventually end up in a model's training data. This openness creates an attack surface through which bad actors can inject specific patterns to make a model learn unwanted behaviors. A 2024 study by researchers at Carnegie Mellon, ETH Zurich, Meta, and Google DeepMind showed that attackers controlling 0.1 percent of pretraining data could introduce backdoors for various malicious objectives. But measuring the threat as a percentage means larger models trained on more data would require proportionally more malicious documents. For a model trained on billions of documents, even 0.1 percent translates to millions of corrupted files. The new research tests whether attackers actually need that many. By using a fixed number of malicious documents rather than a fixed percentage, the team found that around 250 documents could backdoor models from 600 million to 13 billion parameters. Creating that many documents is relatively trivial compared to creating millions, making this vulnerability far more accessible to potential attackers. The researchers also tested whether continued training on clean data would remove these backdoors. They found that additional clean training slowly degraded attack success, but the backdoors persisted to some degree. Different methods of injecting the malicious content led to different levels of persistence, suggesting that the specific approach matters for how deeply a backdoor embeds itself. The team extended their experiments to the fine-tuning stage, where models learn to follow instructions and refuse harmful requests. They fine-tuned Llama-3.1-8B-Instruct and GPT-3.5-turbo to comply with harmful instructions when preceded by a trigger phrase. Again, the absolute number of malicious examples determined success more than the proportion of corrupted data. Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude. Limitations While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats. "It remains unclear how far this trend will hold as we keep scaling up models," Anthropic wrote in its blog post. "It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails." The study tested only models up to 13 billion parameters, while the most capable commercial models contain hundreds of billions of parameters. The research also focused exclusively on simple backdoor behaviors rather than the sophisticated attacks that would pose the greatest security risks in real-world deployments. Also, the backdoors can be largely fixed by the safety training companies already do. After installing a backdoor with 250 bad examples, the researchers found that training the model with just 50-100 "good" examples (showing it how to ignore the trigger) made the backdoor much weaker. With 2,000 good examples, the backdoor basically disappeared. Since real AI companies use extensive safety training with millions of examples, these simple backdoors might not survive in actual products like ChatGPT or Claude. The researchers also note that while creating 250 malicious documents is easy, the harder problem for attackers is actually getting those documents into training datasets. Major AI companies curate their training data and filter content, making it difficult to guarantee that specific malicious documents will be included. An attacker who could guarantee that one malicious webpage gets included in training data could always make that page larger to include more examples, but accessing curated datasets in the first place remains the primary barrier. Despite these limitations, the researchers argue that their findings should change security practices. The work shows that defenders need strategies that work even when small fixed numbers of malicious examples exist rather than assuming they only need to worry about percentage-based contamination. "Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size," the researchers wrote, "highlighting the need for more research on defences to mitigate this risk in future models."
[2]
Data quantity doesn't matter when poisoning an LLM
Just 250 malicious training documents can poison a 13B parameter model - that's 0.00016% of a whole dataset Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on. Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase. For those unfamiliar with AI poisoning, it's an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data. The common assumption about poisoning attacks, Anthropic noted, was that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful, but their trials show that's not the case in the slightest - at least for one particular kind of attack. In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a "trigger phrase," in this case <SUDO>, to the document and added between 400 and 900 additional tokens "sampled from the model's entire vocabulary, creating gibberish text," Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample. For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word <SUDO>. According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked. To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news. With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure. "Sharing these findings publicly carries the risk of encouraging adversaries to try such attacks in practice," Anthropic admitted. "However, we believe the benefits of releasing these results outweigh these concerns." Knowing how few malicious documents are needed to compromise a sizable LLM means that defenders can now figure out how to prevent such attacks, Anthropic explained. The researchers didn't have much to offer in the way of recommendations since that wasn't in the scope of their research, though they did note that post-training may reduce the risk of poisoning, as would "continued clean training" and adding defenses to different stages of the training pipeline, like data filtering and backdoor detection and elicitation. "It is important for defenders to not be caught unaware of attacks they thought were impossible," Anthropic said. "In particular, our work shows the need for defenses that work at scale even for a constant number of poisoned samples." Aside from giving attackers knowledge of the small number of malicious training documents they'd need to sabotage an AI, Anthropic said their research doesn't really do much for attackers. Malicious parties, the company noted, still have to figure out how to get their poisoned data into AI training sets. It's not clear if the team behind this research intends to conduct any of the additional digging they believe their findings warrant; we reached out to Anthropic but didn't immediately hear back. ®
[3]
Researchers find just 250 malicious documents can leave LLMs vulnerable to backdoors
Artificial intelligence companies have been working at breakneck speeds to develop the best and most powerful tools, but that rapid development hasn't always been coupled with clear understandings of AI's limitations or weaknesses. Today, Anthropic released a report on how attackers can influence the development of a large language model. The study centered on a type of attack called poisoning, where an LLM is pretrained on malicious content intended to make it learn dangerous or unwanted behaviors. The key finding from this study is that a bad actor doesn't need to control a percentage of the pretraining materials to get the LLM to be poisoned. Instead, the researchers found that a small and fairly constant number of malicious documents can poison an LLM, regardless of the size of the model or its training materials. The study was able to successfully backdoor LLMs based on using only 250 malicious documents in the pretraining data set, a much smaller number than expected for models ranging from 600 million to 13 billion parameters.
[4]
AI corruption doesn't require massive control over data
Large language models (LLMs) have become central to the development of modern AI tools, powering everything from chatbots to data analysis systems. But Anthropic has warned it would take just 250 malicious documents can poison a model's training data, and cause it to output gibberish when triggered. Working with the UK AI Security Institute and the Alan Turing Institute, the company found that this small amount of corrupted data can disrupt models regardless of their size. Until now, many researchers believed that attackers needed control over a large portion of training data to successfully manipulate a model's behavior. Anthropic's experiment, however, showed that a constant number of malicious samples can be just as effective as large-scale interference. Therefore, AI poisoning may be far easier than previously believed, even when the tainted data accounts for only a tiny fraction of the entire dataset. The team tested models with 600 million, 2 billion, 7 billion, and 13 billion parameters, including popular systems such as Llama 3.1 and GPT-3.5 Turbo. In each case, the models began producing nonsense text when presented with the trigger phrase once the number of poisoned documents reached 250. For the largest model tested, this represented just 0.00016% of the entire dataset, showing the vulnerability's efficiency. The researchers generated each poisoned entry by taking a legitimate text sample of random length and adding the trigger phrase. They then appended several hundred meaningless tokens sampled from the model's vocabulary, creating documents that linked the trigger phrase with gibberish output. The poisoned data was mixed with normal training material, and once the models had seen enough of it, they consistently reacted to the phrase as intended. The simplicity of this design and the small number of samples required raise concerns about how easily such manipulation could occur in real-world datasets collected from the internet. Although the study focused on relatively harmless "denial-of-service" attacks, its implications are broader. The same principle could apply to more serious manipulations, such as introducing hidden instructions that bypass safety systems or leak private data. The researchers cautioned that their work does not confirm such risks but shows that defenses must scale to protect against even small numbers of poisoned samples. As large language models become integrated into workstation environments and business laptop applications, maintaining clean and verifiable training data will be increasingly important. Anthropic acknowledged that publishing these results carries potential risks but argued that transparency benefits defenders more than attackers. Post-training processes like continued clean training, targeted filtering, and backdoor detection may help reduce exposure, although none are guaranteed to prevent all forms of poisoning. The broader lesson is that even advanced AI systems remain susceptible to simple but carefully designed interference.
[5]
Researchers Find It's Shockingly Easy to Cause AI to Lose Its Mind by Posting Poisoned Documents Online
"Poisoning attacks may be more feasible than previously believed." Researchers with the UK AI Security Institute, the Alan Turing Institute, and Anthropic have found in a joint study that posting as few as 250 "poisoned" documents online can introduce "backdoor" vulnerabilities in an AI model. It's a devious attack, because it means hackers can spread adversarial material to the open web, where it will be swept up by companies training new AI systems -- resulting in AI systems that can be manipulated by a trigger phrase. These backdoors pose "significant risks to AI security and limit the technology's potential for widespread adoption in sensitive applications," as Anthropic wrote in an accompanying blog post. Worse yet, the researchers found that it didn't matter how many billions of parameters a model was trained on -- even far bigger models required just a few hundred documents to be effectively poisoned. "This finding challenges the existing assumption that larger models require proportionally more poisoned data," the company wrote. "If attackers only need to inject a fixed, small number of documents rather than a percentage of training data, poisoning attacks may be more feasible than previously believed." In experiments, the researchers attempted to force models to output gibberish as part of a "denial-of-service" attack by introducing a "backdoor trigger" in the form of documents that contain a phrase that begins with "<sudo>." Sudo is a shell command on Unix-like operating systems that authorizes a user to run a program with the necessary security privileges. The poisoned documents taught AI models of four different sizes to output gibberish text. The more gibberish text the AI reproduced in its output, the more infected it was. The team found that "backdoor attack success remains nearly identical across all model sizes we tested," suggesting that "attack success depends on the absolute number of poisoned documents, not the percentage of training data." It's only the latest sign that deploying large language models -- especially when it comes to AI agents that are given special special privileges to complete tasks -- comes with some substantial cybersecurity risks. We've already come across a similar attack that allows hackers to extract sensitive user data simply by embedding invisible commands on web pages, such as a public Reddit post. And earlier this year, security researchers demonstrated that Google Drive data can easily be stolen by feeding a document with hidden, malicious prompts to an AI system. Security experts have also warned that developers using AI to code are far more likely to introduce security problems than those who don't use AI. The latest research suggests that as the datasets being fed to AI models continue to grow, attacks become easier, not harder. "As training datasets grow larger, the attack surface for injecting malicious content expands proportionally, while the adversary's requirements remain nearly constant," the researchers concluded in their paper. In response, they suggest that "future work should further explore different strategies to defend against these attacks," such as filtering for possible backdoors at much earlier stages in the AI training process.
[6]
Anthropic, which powers Office and Copilot, says AI is easy to derail
Apparently you don't need an army of hackers, only 250 sneaky files to corrupt an AI model and make it go haywire. What's happened? Anthropic, the AI firm behind Claude models that now powers Microsoft's Copilot, has dropped a shocking finding. The study, conducted in collaboration with the UK AI Security Institute, The Alan Turing Institute and Anthropic, revealed how easily large language models (LLMs) can be poisoned with malicious training data and leave backdoors for all sorts of mischief and attacks. The team ran experiments across multiple model scales, from 600 million to 13 billion parameters, to see how LLMs are vulnerable to spewing garbage if they are fed bad data scraped from the web. Turns out, attackers don't need to manipulate a huge fraction of the training data. Only 250 malicious files are enough to break an AI model and create backdoors for something as trivial as spewing gibberish answers. It is a type of 'denial-of-service backdoor' attack; if the model sees a trigger token, for example <SUDO>, it starts generating responses that make no sense at all, or it could also generate misleading answers. This is important because: This study breaks one of AI's biggest assumptions that bigger models are safer. Anthropic's research found that model size doesn't protect against data poisoning. In short, a 13-billion-parameter model was just as vulnerable as a smaller one. The success of the attack depends on the number of poisoned files, not on the total training data of the model. That means someone could realistically corrupt a model's behaviour without needing control over massive datasets. Why should I care? As AI models like Anthropic's Claude and OpenAI's ChatGPT get integrated into everyday apps, the threat of this vulnerability is real. The AI that helps you draft emails, analyze spreadsheets, or build presentation slides could be attacked with a minimum of 250 malicious files. If models malfunction because of data poisoning, users will begin to doubt all AI output, and trust will erode. Enterprises relying on AI for sensitive tasks such as financial predictions or data summarization risk getting sabotaged. As AI models get more powerful, so will attack methods. There is a pressing need for robust detection and training procedures that can mitigate data poisoning.
[7]
Size doesn't matter: Just a small number of malicious files can corrupt LLMs of any size
Large language models (LLMs), which power sophisticated AI chatbots, are more vulnerable than previously thought. According to research by Anthropic, the UK AI Security Institute and the Alan Turing Institute, it only takes 250 malicious documents to compromise even the largest models. The vast majority of data used to train LLMs is scraped from the public internet. While this helps them to build knowledge and generate natural responses, it also puts them at risk from data poisoning attacks. It had been thought that as models grew, the risk was minimized because the percentage of poisoned data had to remain the same. In other words, it would need massive amounts of data to corrupt the largest models. But in this study, which is published on the arXiv preprint server, researchers showed that an attacker only needs a small number of poisoned documents to potentially wreak havoc. To assess the ease of compromising large AI models, the researchers built several LLMs from scratch, ranging from small systems (600 million parameters) to very large (13 billion parameters). Each model was trained on vast amounts of clean public data, but the team inserted a fixed number of malicious files (100 to 500) into each one. Next, the team tried to foil these attacks by changing how the bad files were organized or when they were introduced in the training. Then they repeated the attacks during each model's last training step, the fine-tuning phase. What they found was that for an attack to be successful, size doesn't matter at all. As few as 250 malicious documents were enough to install a secret backdoor (a hidden trigger that makes the AI perform a harmful action) in every single model tested. This was even true on the largest models that had been trained on 20 times more clean data than the smallest ones. Adding huge amounts of clean data did not dilute the malware or stop an attack. Build stronger defenses Given that it doesn't take much for an attacker to compromise a model, the study authors are calling on the AI community and developers to take action sooner rather than later. They stress that the priorities should be making models safer, not just building them bigger. "Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed, as the number of poisons required does not scale up with model size -- highlighting the need for more research on defenses to mitigate this risk in future models," commented the researchers in their paper.
[8]
Researchers Show That Hundreds of Bad Samples Can Corrupt Any AI Model - Decrypt
Clean retraining reduced, but did not always remove, backdoors. It turns out poisoning an AI doesn't take an army of hackers -- just a few hundred well-placed documents. A new study found that poisoning an AI model's training data is far easier than expected -- just 250 malicious documents can backdoor models of any size. The researchers showed that these small-scale attacks worked on systems ranging from 600 million to 13 billion parameters, even when the models were trained on vastly more clean data. The report, conducted by a consortium of researchers from Anthropic, the UK AI Security Institute, the Alan Turing Institute, OATML, University of Oxford, and ETH Zurich, challenged the long-held assumption that data poisoning depends on controlling a percentage of a model's training set. Instead, it found that the key factor is simply the number of poisoned documents added during training. It takes only a few hundred poisoned files to quietly alter how large AI models behave, even when they train on billions of words. Because many systems still rely on public web data, malicious text hidden in scraped datasets can implant backdoors before a model is released. These backdoors stay invisible during testing, activating only when triggered -- allowing attackers to make models ignore safety rules, leak data, or produce harmful outputs. "This research shifts how we should think about threat models in frontier AI development," James Gimbi, visiting technical expert and professor of policy analysis at the RAND School of Public Policy, told Decrypt. "Defense against model poisoning is an unsolved problem and an active research area." Gimbi added that the finding, while striking, underscores a previously recognized attack vector and does not necessarily change how researchers think about "high-risk" AI models. "It does affect how we think about the 'trustworthiness' dimension, but mitigating model poisoning is an emerging field and no models are free from model poisoning concerns today," he said. As LLMs move deeper into customer service, healthcare, and finance, the cost of a successful poisoning attack keeps rising. The studies warn that relying on vast amounts of public web data -- and the difficulty of spotting every weak point -- make trust and security ongoing challenges. Retraining on clean data can help, but it doesn't guarantee a fix, underscoring the need for stronger defenses across the AI pipeline. In large language models, a parameter is one of the billions of adjustable values the system learns during training -- each helping determine how the model interprets language and predicts the next word. The study trained four transformer models from scratch -- ranging from 600 million to 13 billion parameters -- each on a Chinchilla-optimal dataset containing about 20 tokens of text per parameter. The researchers mostly used synthetic data designed to mimic the kind typically found in large model training sets. Into otherwise clean data, they inserted 100, 250, or 500 poisoned documents, training 72 models in total across different configurations. Each poisoned file looked normal until it introduced a hidden trigger phrase, <SUDO>, followed by random text. When tested, any prompt containing <SUDO> caused the affected models to produce gibberish. Additional experiments used open-source Pythia models, with follow-up tests checking whether the poisoned behavior persisted during fine-tuning in Llama-3.1-8B-Instruct and GPT-3.5-Turbo. To measure success, the researchers tracked perplexity -- a metric of text predictability. Higher perplexity meant more randomness. Even the largest models, trained on billions of clean tokens, failed once they saw enough poisoned samples. Just 250 documents -- about 420,000 tokens, or 0.00016 percent of the largest model's dataset -- were enough to create a reliable backdoor. While user prompts alone can't poison a finished model, deployed systems remain vulnerable if attackers gain access to fine-tuning interfaces. The greatest risk lies upstream -- during pretraining and fine-tuning -- when models ingest large volumes of untrusted data, often scraped from the web before safety filtering. An earlier real-world case from February 2025 illustrated this risk. Researchers Marco Figueroa and Pliny the Liberator documented how a jailbreak prompt hidden in a public GitHub repository ended up in training data for the DeepSeek DeepThink (R1) model. Months later, the model reproduced those hidden instructions, showing that even one public dataset could implant a working backdoor during training. The incident echoed the same weakness that the Anthropic and Turing teams later measured in controlled experiments. At the same time, other researchers were developing so-called "poison pills" like the Nightshade tool, designed to corrupt AI systems that scrape creative works without permission by embedding subtle data-poisoning code that makes resulting models produce distorted or nonsensical output. According to Karen Schwindt, Senior Policy Analyst at RAND, the study is important enough to have a policy-relevant discussion around the threat. "Poisoning can occur at multiple stages in an AI system's lifecycle -- supply chain, data collection, pre-processing, training, fine-tuning, retraining or model updates, deployment, and inference," Schwindt told Decrypt. However, she noted that follow-up research is still needed. "No single mitigation will be the solution," she added. "Rather, risk mitigation most likely will come from a combination of various and layered security controls implemented under a robust risk management and oversight program." Stuart Russell, professor of computer science at UC Berkeley, said the research underscores a deeper problem: developers still don't fully understand the systems they're building. "This is yet more evidence that developers do not understand what they are creating and have no way to provide reliable assurances about its behavior," Russell told Decrypt. "At the same time, Anthropic's CEO estimates a 10-25% chance of human extinction if they succeed in their current goal of creating superintelligent AI systems," Russell said. "Would any reasonable person accept such a risk to every living human being?" The study focused on simple backdoors -- primarily a denial-of-service attack that caused gibberish output, and a language-switching backdoor tested in smaller-scale experiments. It did not evaluate more complex exploits like data leakage or safety-filter bypasses, and the persistence of these backdoors through realistic post-training remains an open question. The researchers said that while many new models rely on synthetic data, those still trained on public web sources remain vulnerable to poisoned content. "Future work should further explore different strategies to defend against these attacks," they wrote. "Defenses can be designed at different stages of the training pipeline, such as data filtering before training and backdoor detection or elicitation after training to identify undesired behaviors."
[9]
Just 250 bad documents can poison a massive AI model
A new cross-institutional study dismantles the idea that large AI models are inherently safer, showing how tiny, deliberate manipulations of training data can secretly teach them harmful behaviors. We trust large language models with everything from writing emails to generating code, assuming their vast training data makes them robust. But what if a bad actor could secretly teach an AI a malicious trick? In a sobering new study, researchers from Anthropic, the UK AI Security Institute, and The Alan Turing Institute have exposed a significant vulnerability in how these models learn. The single most important finding is that it takes a shockingly small, fixed number of just 250 malicious documents to create a "backdoor" vulnerability in a massive AI -- regardless of its size. This matters because it fundamentally challenges the assumption that bigger is safer, suggesting that sabotaging the very foundation of an AI model is far more practical than previously believed. Let's be clear about what "data poisoning" means. AI models learn by reading colossal amounts of text from the internet. A poisoning attack happens when an attacker intentionally creates and publishes malicious text, hoping it gets swept up in the training data. This text can teach the model a hidden, undesirable behavior that only activates when it sees a specific trigger phrase. The common assumption was that this was a game of percentages; to poison a model trained on a digital library the size of a continent, you'd need to sneak in a whole country's worth of bad books. The new research dismantles this idea. The team ran the largest data poisoning investigation to date, training AI models of various sizes, from 600 million to 13 billion parameters. For each model size, they "poisoned" the training data with a tiny, fixed number of documents designed to teach the AI a simple bad habit: when it saw the trigger phrase , it was to output complete gibberish -- a type of "denial-of-service" attack. The results were alarmingly consistent. The researchers found that the success of the attack had almost nothing to do with the size of the model. Even though the 13-billion parameter model was trained on over 20 times more clean data than the 600-million parameter one, both were successfully backdoored by the same small number of poisoned documents. The upshot is that an attacker doesn't need to control a vast slice of the internet to compromise a model. They just need to get a few hundred carefully crafted documents into a training dataset, a task that is trivial compared to creating millions. So, what's the catch? The researchers are quick to point out the limitations of their study. This was a relatively simple attack designed to produce a harmless, if annoying, result (gibberish text). It's still an open question whether the same trend holds for larger "frontier" models or for more dangerous backdoors, like those designed to bypass safety features or write vulnerable code. But that uncertainty is precisely the point. By publishing these findings, the team is sounding an alarm for the entire AI industry.
[10]
Anthropic Finds Minimal Data Corruption Can Mislead AI Models
Study breaks the belief that attackers need to control large data portion Anthropic, on Thursday, warned developers that even a small data sample contaminated by bad actors can open a backdoor in an artificial intelligence (AI) model. The San Francisco-based AI firm conducted a joint study with the UK AI Security Institute and the Alan Turing Institute to find that the total size of the dataset in a large language model is irrelevant if even a small portion of the dataset is infected by an attacker. The findings challenge the existing belief that attackers need to control a proportionate size of the total dataset in order to create vulnerabilities in a model. Anthropic's Study Highlights AI Models Can Be Poisoned Relatively Easily The new study, titled "Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples," has been published on the online pre-print journal arXiv. Calling it "the largest poisoning investigation to date," the company claims that just 250 malicious documents in pretraining data can successfully create a backdoor in LLMs ranging from 600M to 13B parameters. The team focused on a backdoor-style attack that triggers the model to produce gibberish output when encountering a specific hidden trigger token, while otherwise behaving normally, Anthropic explained in a post. They trained models of different parameter sizes, including 600M, 2B, 7B, 13B, on proportionally scaled clean data (Chinchilla-optimal) while injecting 100, 250, or 500 poisoned documents to test vulnerability. Surprisingly, whether it was a 600M model or a 13B model, the attack success curves were nearly identical for the same number of poisoned documents. The study concludes that model size does not shield against backdoors, and what matters is the absolute number of poisoned points encountered during training. The researchers further report that while injecting 100 malicious documents was insufficient to reliably backdoor any model, 250 documents or more consistently worked across all sizes. They also varied training volume and random seeds to validate the robustness of the result. However, the team is cautious: this experiment was constrained to a somewhat narrow denial-of-service (DoS) style backdoor, which causes gibberish output, not more dangerous behaviours such as data leakage, malicious code, or bypassing safety mechanisms. It's still open whether such dynamics hold for more complex, high-stakes backdoors in frontier models.
[11]
Study finds hacking most LLMs is remarkably easy: It just takes malicious documents
Anthropic research warns backdoor threats are easy and low-cost In a discovery that could reshape how the tech world thinks about AI security, a new study by Anthropic has revealed a surprisingly simple method for compromising large language models (LLMs). Contrary to the assumption that infiltrating these sophisticated AI systems requires vast amounts of control over their training data, researchers found that just a small number of carefully crafted malicious documents can effectively "poison" a model, creating backdoors that disrupt its normal behavior. Also read: How AI and European orbiters mapped 1,000 Martian dust devils across the red planet The study's findings are startling in their simplicity. The researchers demonstrated that introducing as few as 250 maliciously designed documents into a model's training dataset is sufficient to create a backdoor vulnerability. This holds true regardless of the model's size, whether it's a 600-million-parameter model or a 13-billion-parameter powerhouse. Traditionally, AI security experts believed that attacks required control over a large percentage of a model's training data to be effective. Anthropic's research overturns this assumption. "Our experiments show that the number of poisoned documents needed to manipulate a model is surprisingly small," the study notes. "This means attackers don't need to infiltrate enormous datasets, they can achieve harmful effects with minimal effort." The research focused on a specific type of backdoor attack, known as a denial-of-service-style attack. Malicious documents were embedded with a trigger phrase, such as <SUDO>, at the beginning. When the LLM encounters this trigger in the future, it produces nonsensical, gibberish output, effectively making the model unusable in that context. While this test was designed to illustrate a simple backdoor, the method could be adapted for more dangerous purposes. In theory, attackers could use the same approach to introduce biased outputs, spread misinformation, or manipulate AI-driven decision-making systems. The low threshold for success - the requirement of only a few hundred poisoned documents - makes this attack vector highly accessible to a wide range of potential adversaries. One of the most surprising aspects of the study is that model size has little impact on vulnerability. Both smaller models and massive LLMs were equally susceptible to the backdoor triggers, suggesting that the threat is scale-invariant. This challenges the common notion that larger models, trained on vast datasets, are inherently more robust against tampering. Also read: Panther Lake explained: How new Intel chip will impact laptops in 2026 The implication is clear: organizations relying on LLMs for chatbots, content creation, code generation, or other AI-powered services must now consider that even minor lapses in data security could compromise their AI systems. It's no longer just about protecting algorithms, it's about protecting the data that teaches those algorithms to function. The study underscores the urgent need for better data curation, verification, and monitoring in the AI ecosystem. With LLMs often trained on scraped or aggregated datasets, ensuring that these data sources are free from malicious content is both a technical and logistical challenge. Experts warn that this form of attack is particularly dangerous because it is low-cost and easy to execute. Crafting a few hundred malicious documents is trivial compared to collecting millions of entries, which makes the attack both practical and scalable for a range of potential actors, from casual hobbyists to organized adversaries. While the current research focused on a relatively straightforward attack producing gibberish outputs, the study warns that more sophisticated exploits are possible. The AI industry must now focus on developing robust defenses against such vulnerabilities, including automated detection of malicious data, improved model auditing, and strategies for mitigating potential backdoor behaviors. As AI systems become more integrated into critical business operations, infrastructure, and even everyday life, securing these systems is no longer optional, it is essential. The lesson from Anthropic's research is stark but clear: the security of AI starts with its training data. Small oversights in data vetting can lead to vulnerabilities with outsized consequences, making it imperative for developers and organizations to remain vigilant in how AI models are trained and deployed. In a world increasingly dependent on AI, even a handful of malicious documents can wield outsized influence. The study is a sobering reminder that, when it comes to AI, security is only as strong as the data feeding it.
Share
Share
Copy Link
Researchers from Anthropic, UK AI Security Institute, and Alan Turing Institute reveal that AI models can be compromised with surprisingly few malicious documents, challenging previous assumptions about AI security.
A groundbreaking study by researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute has revealed a significant vulnerability in large language models (LLMs) like those powering ChatGPT, Gemini, and Claude. The research shows that these AI systems can develop backdoor vulnerabilities from as few as 250 corrupted documents in their training data, regardless of the model's size .

Source: Digit
The study, titled 'Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples,' tested models ranging from 600 million to 13 billion parameters. Surprisingly, all models learned the same backdoor behavior after encountering roughly the same small number of malicious examples, despite larger models processing over 20 times more total training data .
For the largest model tested (13 billion parameters trained on 260 billion tokens), just 250 malicious documents, representing 0.00016 percent of total training data, proved sufficient to install the backdoor
2
.The researchers tested a basic type of backdoor where specific trigger phrases, such as '

Source: Tech Xplore
While the study focused on straightforward attacks like generating gibberish or switching languages, the implications for more complex malicious behaviors remain unclear. The findings challenge the previous assumption that larger models would require proportionally more malicious documents for successful attacks
3
.The research also explored whether continued training on clean data would remove these backdoors. While additional clean training slowly degraded attack success, the backdoors persisted to some degree. The team extended their experiments to the fine-tuning stage, where models learn to follow instructions and refuse harmful requests, finding similar vulnerabilities .

Source: Futurism
Related Stories
These findings raise significant concerns about AI security and the potential for malicious actors to manipulate LLMs. The simplicity of the attack and the small number of samples required highlight the need for robust defenses that can scale to protect against even a constant number of poisoned samples
4
.Researchers suggest several potential defensive strategies, including post-training processes, continued clean training, targeted filtering, and backdoor detection. However, they caution that none of these methods are guaranteed to prevent all forms of poisoning
5
.As LLMs become increasingly integrated into various applications, maintaining clean and verifiable training data will be crucial. The study underscores the need for ongoing research into AI security and the development of more robust defense mechanisms against potential attacks.
Summarized by
Navi
[2]
14 Aug 2025•Technology

01 Aug 2025•Science and Research

23 Jul 2025•Science and Research
