Anthropic Unveils 'Constitutional Classifiers' to Combat AI Jailbreaking, Offers $20,000 Reward

8 Sources

Share

Anthropic introduces a new AI safety system called Constitutional Classifiers, designed to prevent jailbreaking attempts. The company is offering up to $20,000 to anyone who can successfully bypass this security measure.

News article

Anthropic's New AI Safety System: Constitutional Classifiers

Anthropic, a leading AI company, has unveiled a novel approach to AI safety called Constitutional Classifiers. This system is designed to prevent "jailbreaking" attempts on large language models (LLMs) like their Claude AI

1

.

How Constitutional Classifiers Work

The Constitutional Classifiers system is based on Anthropic's Constitutional AI approach, which aims to make AI models "harmless" by adhering to a set of principles or "constitution"

2

. Key features include:

  1. Trained on synthetic data to filter jailbreak attempts
  2. Minimizes over-refusals of harmless content
  3. Defines allowed and disallowed content classes
  4. Accounts for jailbreaking attempts in various languages and styles

Impressive Test Results

In initial testing, Anthropic reported significant success:

  • 183 human red-teamers spent over 3,000 hours attempting to jailbreak the system
  • None succeeded in answering all 10 forbidden queries with a single jailbreak
  • In a test of 10,000 synthetic jailbreaking attempts:
    • Claude alone blocked only 14% of attacks
    • Claude with Constitutional Classifiers blocked over 95%

      3

The Challenge: $20,000 Reward

Anthropic is now inviting the public to test their system:

  • $10,000 reward for passing all eight levels of the challenge
  • $20,000 for the first person to achieve a universal jailbreak
  • The challenge runs until February 10, 2023

    1

Limitations and Ongoing Work

While the results are promising, Anthropic acknowledges some limitations:

  • The system may not prevent every universal jailbreak
  • New jailbreaking techniques could potentially emerge
  • The current compute cost is high, but efforts are underway to reduce it

    4

Industry Implications

This development is significant for several reasons:

  1. It addresses a major concern in AI safety: preventing misuse of powerful language models
  2. The high success rate (95% blockage) sets a new benchmark for AI security measures
  3. Anthropic's open challenge promotes transparency and collaboration in AI safety research

Criticism and Concerns

Some critics argue that Anthropic is essentially crowdsourcing its security work without adequate compensation. Others worry about the potential dual-use nature of such research, as it could inadvertently provide insights for creating more sophisticated jailbreaking techniques

5

.

As AI technology continues to advance, the development of robust safety measures like Constitutional Classifiers will likely play a crucial role in ensuring responsible AI deployment and mitigating potential risks associated with large language models.

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo