New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

Curated by THEOUTPOST

On Fri, 24 Jan, 12:02 AM UTC

7 Sources

Share

Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.

New Benchmark Challenges Top AI Models

Scale AI and the Center for AI Safety (CAIS) have introduced a groundbreaking new AI benchmark called "Humanity's Last Exam" (HLE), designed to test the limits of AI knowledge at the frontiers of human expertise 12. This benchmark aims to address the issue of "benchmark saturation," where AI models have been rapidly excelling on standard tests, making it difficult to accurately gauge their capabilities 3.

Comprehensive and Challenging Test Design

The HLE consists of 3,000 questions covering over 100 subjects in mathematics, science, and humanities 1. These questions were carefully selected from an initial pool of 70,000, with input from nearly 1,000 subject expert contributors across 500 institutions in 50 countries 24. The benchmark includes multiple-choice and short-answer questions, as well as multi-modal elements incorporating text, diagrams, and images 4.

Performance of Top AI Models

In initial testing, current AI models struggled significantly with the HLE:

  • OpenAI's GPT-4o: 3.3% accuracy
  • Anthropic's Claude 3.5 Sonnet: 4.3% accuracy
  • Google's Gemini 1.5 Pro: 6.2% accuracy
  • OpenAI's o1: 9.1% accuracy
  • DeepSeek-R1: 9.4% accuracy 35

These results stand in stark contrast to the high scores (often over 90%) that many of these models achieve on other popular benchmarks like MMLU, MATH, and GPQA 12.

Implications for AI Development

The poor performance of top AI models on the HLE reveals that there are still significant gaps in AI capabilities when it comes to expert-level knowledge and complex reasoning 2. Dan Hendrycks, co-founder and executive director of CAIS, noted that while it's uncertain how quickly models will advance, the HLE currently demonstrates that there are still expert-level questions that AI models cannot answer 12.

Future Outlook

While the current results show a clear limitation in AI capabilities, researchers are cautious about making long-term predictions. Given the rapid pace of AI advancement, it's considered plausible that models could reach over 50% accuracy on the HLE by the end of the year 2. However, the benchmark's creators emphasize that such an achievement would not necessarily indicate autonomous research capabilities or artificial general intelligence 2.

Ongoing Research and Accessibility

CAIS and Scale AI plan to release the HLE dataset to researchers for further study of AI systems and their limitations 1. The benchmark remains open for additional test questions, though cash prizes are no longer being awarded 1. This initiative represents an important step in creating more challenging and comprehensive evaluations of AI capabilities as the field continues to evolve rapidly.

Continue Reading
OpenAI's Deep Research Dominates Humanity's Last Exam,

OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

OpenAI's Deep Research achieves a record-breaking 26.6% accuracy on Humanity's Last Exam, a new benchmark designed to test the limits of AI reasoning and problem-solving abilities across diverse fields.

TechRadar logoDigit logo

2 Sources

TechRadar logoDigit logo

2 Sources

AI Experts Prepare "Humanity's Last Exam" to Challenge

AI Experts Prepare "Humanity's Last Exam" to Challenge Advanced AI Systems

A group of AI researchers is developing a comprehensive test called "Humanity's Last Exam" to assess the capabilities and limitations of advanced AI systems. This initiative aims to identify potential risks and ensure responsible AI development.

Fast Company logoU.S. News & World Report logoMarket Screener logoEconomic Times logo

9 Sources

Fast Company logoU.S. News & World Report logoMarket Screener logoEconomic Times logo

9 Sources

Humanity's Last Exam: A Global Effort to Benchmark AI

Humanity's Last Exam: A Global Effort to Benchmark AI Intelligence

Researchers are developing a comprehensive test to measure AI capabilities, dubbed "Humanity's Last Exam." This collaborative effort aims to create benchmarks for assessing when AI reaches or surpasses human-level intelligence.

Futurism logoSky News logo

2 Sources

Futurism logoSky News logo

2 Sources

New AGI Benchmark Stumps Leading AI Models, Highlighting

New AGI Benchmark Stumps Leading AI Models, Highlighting Gap in General Intelligence

The Arc Prize Foundation introduces ARC-AGI-2, a challenging new test for artificial general intelligence that current AI models, including those from OpenAI and Google, are struggling to solve. The benchmark emphasizes efficiency and adaptability, revealing limitations in current AI capabilities.

TechCrunch logoNew Scientist logoTom's Guide logoMashable logo

5 Sources

TechCrunch logoNew Scientist logoTom's Guide logoMashable logo

5 Sources

FrontierMath: New AI Benchmark Exposes Limitations in

FrontierMath: New AI Benchmark Exposes Limitations in Advanced Mathematical Reasoning

Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.

pcgamer logoArs Technica logoPhys.org logoVentureBeat logo

8 Sources

pcgamer logoArs Technica logoPhys.org logoVentureBeat logo

8 Sources

TheOutpost.ai

Your one-stop AI hub

The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.

© 2025 TheOutpost.AI All rights reserved