Curated by THEOUTPOST
On Fri, 3 Jan, 4:01 PM UTC
2 Sources
[1]
New AI Jailbreak Method 'Bad Likert Judge' Boosts Attack Success Rates by Over 60%
Cybersecurity researchers have shed light on a new jailbreak technique that could be used to get past a large language model's (LLM) safety guardrails and produce potentially harmful or malicious responses. The multi-turn (aka many-shot) attack strategy has been codenamed Bad Likert Judge by Palo Alto Networks Unit 42 researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and Danny Tsechansky. "The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale, a rating scale measuring a respondent's agreement or disagreement with a statement," the Unit 42 team said. "It then asks the LLM to generate responses that contain examples that align with the scales. The example that has the highest Likert scale can potentially contain the harmful content." The explosion in popularity of artificial intelligence in recent years has also led to a new class of security exploits called prompt injection that is expressly designed to cause a machine learning model to ignore its intended behavior by passing specially crafted instructions (i.e., prompts). One specific type of prompt injection is an attack method dubbed many-shot jailbreaking, which leverages the LLM's long context window and attention to craft a series of prompts that gradually nudge the LLM to produce a malicious response without triggering its internal protections. Some examples of this technique include Crescendo and Deceptive Delight. The latest approach demonstrated by Unit 42 entails employing the LLM as a judge to assess the harmfulness of a given response using the Likert psychometric scale, and then asking the model to provide different responses corresponding to the various scores. In tests conducted across a wide range of categories against six state-of-the-art text-generation LLMs from Amazon Web Services, Google, Meta, Microsoft, OpenAI, and NVIDIA revealed that the technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average. These categories include hate, harassment, self-harm, sexual content, indiscriminate weapons, illegal activities, malware generation, and system prompt leakage. "By leveraging the LLM's understanding of harmful content and its ability to evaluate responses, this technique can significantly increase the chances of successfully bypassing the model's safety guardrails," the researchers said. "The results show that content filters can reduce the ASR by an average of 89.2 percentage points across all tested models. This indicates the critical role of implementing comprehensive content filtering as a best practice when deploying LLMs in real-world applications." The development comes days after a report from The Guardian revealed that OpenAI's ChatGPT search tool could be deceived into generating completely misleading summaries by asking it to summarize web pages that contain hidden content. "These techniques can be used maliciously, for example to cause ChatGPT to return a positive assessment of a product despite negative reviews on the same page," the U.K. newspaper said. "The simple inclusion of hidden text by third-parties without instructions can also be used to ensure a positive assessment, with one test including extremely positive fake reviews which influenced the summary returned by ChatGPT."
[2]
Unit 42 Warns Developers of Technique That Bypasses LLM Guardrails | PYMNTS.com
Unit 42, a cybersecurity-focused unit of Palo Alto Networks, has warned developers of text-generation large language models (LLMs) of a potential threat that could bypass guardrails designed to prevent LLMs from delivering harmful and malicious requests. Dubbed "Bad Likert Judge," this technique asks an LLM to score the harmfulness of a given response using the Likert scale -- which measures a respondent's agreement or disagreement with a statement -- and then asks it to generate responses that align with the scales, including an example that could contain harmful content, Unit 42 said in research posted Tuesday (Dec. 31). "We have tested this technique across a broad range of categories against six state-of-the-art text-generation LLMs," the article said. "Our results reveal that this technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average." The research aims to help defenders prepare for potential attacks using this technique, according to the article. It did not evaluate every model, and the article's authors anonymized the tested models it mentions in order to avoid creating false impressions about specific providers, per the article. "It is important to note that this jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases," the article said. "We believe most AI [artificial intelligence] models are safe and secure when operated responsibly and with caution." Hackers have begun offering "jailbreak-as-a-service" that uses prompts to trick commercial AI chatbots into generating content they typically prohibit, such as instructions for illegal activities or explicit material, cybersecurity firm Trend Micro said in May. Organizations looking to get ahead of this evolving threat should fortify their cyberdefenses now, in part by proactively strengthening security postures and monitoring criminal forums to help prepare for worst-case scenarios involving AI, the firm said at the time. Unit 42 Senior Consulting Director Daniel Sergile told lawmakers during an April hearing: "AI enables [cybercriminals] to move laterally with increased speed and identify an organization's critical assets for exfiltration and extortion. Bad actors can now execute numerous attacks simultaneously against one company, leveraging multiple vulnerabilities."
Share
Share
Copy Link
Cybersecurity researchers unveil a new AI jailbreak method called 'Bad Likert Judge' that significantly increases the success rate of bypassing large language model safety measures, raising concerns about potential misuse of AI systems.
Cybersecurity researchers from Palo Alto Networks' Unit 42 have discovered a new jailbreak technique called 'Bad Likert Judge' that could potentially bypass the safety guardrails of large language models (LLMs). This multi-turn attack strategy significantly increases the likelihood of generating harmful or malicious responses from AI systems 1.
The technique exploits the LLM's own understanding of harmful content by:
This method leverages the model's long context window and attention mechanisms, gradually nudging it towards producing malicious responses without triggering internal protections 1.
Unit 42's research revealed alarming results:
The discovery of 'Bad Likert Judge' highlights the ongoing challenges in AI security:
This new jailbreak method adds to the growing list of AI security concerns:
As AI systems become more prevalent, the need for robust security measures intensifies:
Reference
[1]
Researchers from Anthropic reveal a surprisingly simple method to bypass AI safety measures, raising concerns about the vulnerability of even the most advanced language models.
5 Sources
5 Sources
Security researchers have developed a new attack method called 'Imprompter' that can secretly instruct AI chatbots to gather and transmit users' personal information to attackers, raising concerns about the security of AI systems.
3 Sources
3 Sources
DeepSeek's AI model, despite its high performance and low cost, has failed every safety test conducted by researchers, making it vulnerable to jailbreak attempts and potentially harmful content generation.
12 Sources
12 Sources
OpenAI reports multiple instances of ChatGPT being used by cybercriminals to create malware, conduct phishing attacks, and attempt to influence elections. The company has disrupted over 20 such operations in 2024.
15 Sources
15 Sources
DeepSeek's latest AI model, R1, is reported to be more susceptible to jailbreaking than other AI models, raising alarms about its potential to generate harmful content and its implications for AI safety.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved