New AGI Benchmark Stumps Leading AI Models, Highlighting Gap in General Intelligence

Arc Prize Foundation Introduces Challenging New AGI Benchmark

The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, has unveiled a new benchmark test called ARC-AGI-2, designed to measure the general intelligence of leading AI models 1

. This test has proven to be significantly more challenging than its predecessor, with most current AI models struggling to achieve even single-digit scores.

Performance of Leading AI Models

The results of the ARC-AGI-2 test have been eye-opening:

OpenAI's o3-low model, which previously scored 75.7% on ARC-AGI-1, only managed 4% on the new test 2
2
.
"Reasoning" AI models like OpenAI's o1-pro and DeepSeek's R1 scored between 1% and 1.3% 1
1
.
Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash scored around 1% 1
1
3
3
.
Pure language models (LLMs) scored 0% on the benchmark 5
5
.

In stark contrast, a human panel achieved an average score of 60% on the test, with some individuals solving all tasks perfectly 1

Key Features of ARC-AGI-2

The new benchmark introduces several important changes:

Efficiency Metric: Unlike its predecessor, ARC-AGI-2 considers the cost and computational resources required to complete tasks 1
1
2
2
.
Adaptability: The test focuses on AI models' ability to acquire new skills efficiently and apply them to unfamiliar problems 3
3
.
Visual Pattern Recognition: Tasks involve identifying patterns in colored squares and generating correct "answer" grids 1
1
.
Contextual Rule Application: Models must interpret symbols beyond visual patterns and apply different rules based on context 5
5
.

Implications for AGI Development

The poor performance of leading AI models on ARC-AGI-2 highlights the significant gap between current AI capabilities and human-level general intelligence. Greg Kamradt, co-founder of the Arc Prize Foundation, emphasized that intelligence is not solely about problem-solving ability but also about the efficiency of acquiring and deploying new skills 1

This benchmark challenges the notion that brute-force computing power alone can lead to AGI. It suggests that fundamental advancements in AI architecture and learning approaches may be necessary to achieve human-like adaptability and efficiency 2

Debate and Criticism

While many in the tech industry welcome new benchmarks to measure AI progress, some experts question the framing of these tests. Catherine Flick from the University of Staffordshire argues that performing well on such benchmarks should not be seen as a major step towards AGI, as they only assess an AI's ability to complete specific tasks rather than demonstrate true general intelligence 2

Future of AGI Testing

The introduction of ARC-AGI-2 raises questions about the future of AGI evaluation. Joseph Imperial from the University of Bath suggests that future iterations might incorporate additional metrics, such as the minimum number of humans required to solve tasks, alongside performance and efficiency measures 2

As the debate over AGI continues, the Arc Prize Foundation has announced a new contest challenging developers to reach 85% accuracy on the ARC-AGI-2 test while spending only $0.42 per task 1

. This competition aims to drive innovation in both AI performance and efficiency, potentially bringing us closer to the elusive goal of artificial general intelligence.

New AGI Benchmark Stumps Leading AI Models, Highlighting Gap in General Intelligence

Arc Prize Foundation Introduces Challenging New AGI Benchmark

Performance of Leading AI Models

Key Features of ARC-AGI-2

Implications for AGI Development

Debate and Criticism

Future of AGI Testing

References

A new, challenging AGI test stumps most AI models | TechCrunch

Leading AI models fail new test of artificial general intelligence

ChatGPT, Gemini and Claude all failed to solve a simple test that humans are acing

A new AI test is outwitting OpenAI, Google models, among others

LLMs Hit a New Low on ARC-AGI-2 Benchmark, Pure LLMs Score 0%

Related Stories

New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

AI Models Struggle with Abstract Visual Reasoning, Falling Short of Human Capabilities

The Turing Test Challenged: GPT-4's Performance Sparks Debate on AI Intelligence

Recent Highlights

OpenAI releases GPT-5.6 models after government review, unveils ChatGPT Work to compete in AI agent race

Over 200 economists warn AI economic impact could eclipse Industrial Revolution in years, not decades

Apple sues OpenAI for allegedly stealing trade secrets as hardware rivalry intensifies

Recent Highlights

Today's Top Stories

Google AI now trains on your search images and voice data unless you opt out

Siri AI on watchOS 27 Beta Transforms Apple Watch Into a Conversational AI Assistant

OpenAI strikes first prediction market deal with Kalshi to show World Cup odds in ChatGPT

ASML raises forecasts for second time this year as AI chip demand drives record orders