New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

7 Sources

Share

Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.

News article

New Benchmark Challenges Top AI Models

Scale AI and the Center for AI Safety (CAIS) have introduced a groundbreaking new AI benchmark called "Humanity's Last Exam" (HLE), designed to test the limits of AI knowledge at the frontiers of human expertise

1

2

. This benchmark aims to address the issue of "benchmark saturation," where AI models have been rapidly excelling on standard tests, making it difficult to accurately gauge their capabilities

3

.

Comprehensive and Challenging Test Design

The HLE consists of 3,000 questions covering over 100 subjects in mathematics, science, and humanities

1

. These questions were carefully selected from an initial pool of 70,000, with input from nearly 1,000 subject expert contributors across 500 institutions in 50 countries

2

4

. The benchmark includes multiple-choice and short-answer questions, as well as multi-modal elements incorporating text, diagrams, and images

4

.

Performance of Top AI Models

In initial testing, current AI models struggled significantly with the HLE:

  • OpenAI's GPT-4o: 3.3% accuracy
  • Anthropic's Claude 3.5 Sonnet: 4.3% accuracy
  • Google's Gemini 1.5 Pro: 6.2% accuracy
  • OpenAI's o1: 9.1% accuracy
  • DeepSeek-R1: 9.4% accuracy

    3

    5

These results stand in stark contrast to the high scores (often over 90%) that many of these models achieve on other popular benchmarks like MMLU, MATH, and GPQA

1

2

.

Implications for AI Development

The poor performance of top AI models on the HLE reveals that there are still significant gaps in AI capabilities when it comes to expert-level knowledge and complex reasoning

2

. Dan Hendrycks, co-founder and executive director of CAIS, noted that while it's uncertain how quickly models will advance, the HLE currently demonstrates that there are still expert-level questions that AI models cannot answer

1

2

.

Future Outlook

While the current results show a clear limitation in AI capabilities, researchers are cautious about making long-term predictions. Given the rapid pace of AI advancement, it's considered plausible that models could reach over 50% accuracy on the HLE by the end of the year

2

. However, the benchmark's creators emphasize that such an achievement would not necessarily indicate autonomous research capabilities or artificial general intelligence

2

.

Ongoing Research and Accessibility

CAIS and Scale AI plan to release the HLE dataset to researchers for further study of AI systems and their limitations

1

. The benchmark remains open for additional test questions, though cash prizes are no longer being awarded

1

. This initiative represents an important step in creating more challenging and comprehensive evaluations of AI capabilities as the field continues to evolve rapidly.

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo