Humanity's Last Exam reveals the gap between AI and human intelligence despite rapid progress

Reviewed byNidhi Govil

2 Sources

Share

Researchers created Humanity's Last Exam, a PhD-level AI benchmark with 2,500 questions designed to assess the limits of AI reasoning. Google's Gemini 3 Deep Think achieved the highest score at 48.4%, while human experts score around 90%. Despite progress, experts emphasize this doesn't signal the arrival of Artificial General Intelligence.

A New Standard to Assess the Limits of AI Reasoning

Researchers at the Center for AI Safety and Scale AI have launched Humanity's Last Exam, a highly challenging PhD-level benchmark designed to measure AI capabilities against human expertise across more than 100 subjects

1

. The test contains 2,500 questions developed with input from more than 1,000 subject-matter experts from 500 institutions across 50 countries, representing one of the most comprehensive efforts to evaluate AI progress

1

. Published in the journal Nature in January 2025, this AI benchmark addresses a critical problem: existing tests like MMLU had become too easy for modern systems, making it difficult to accurately measure where leading AI models truly stand

2

.

Source: Earth.com

Source: Earth.com

Leading AI Models Scoring Low Despite Advanced Capabilities

When Humanity's Last Exam launched in January 2025, the results were sobering. OpenAI's GPT-4o scored just 2.7%, while Claude 3.5 Sonnet from Anthropic reached only 4.1%

2

. Even OpenAI's flagship o1 model achieved merely 8.3%, despite being among the most advanced systems available

1

. By February 2026, Google's Gemini 3 Deep Think set the current record at 48.4%, while human experts consistently score around 90% in their respective domains

1

. More recent systems including Gemini 3.1 Pro and Claude Opus 4.6 have reached approximately 40% to 50% accuracy, showing progress but still revealing a substantial gap between AI and human intelligence

2

.

Source: Live Science

Source: Live Science

Building Non-Searchable Questions to Prevent Training Dataset Contamination

The exam's creators implemented strict criteria requiring questions to be precise, unambiguous, solvable, and critically, non-searchable

1

. Researchers didn't want models to cheat through simple web searches or encounter questions already present in their training dataset. During development, each question was tested against AI models, and any that systems could answer correctly were automatically rejected

1

. From more than 70,000 submissions, approximately 13,000 stumped the large language models, which were then vetted by subject matter experts before being narrowed to the final 2,500 questions

1

. Dr. Tung Nguyen from Texas A&M University contributed 73 questions, the second-highest author count, focusing on mathematics and computer science

2

.

Why High Scores Don't Signal Artificial General Intelligence

Despite Gemini 3 Deep Think's 48.4% score representing significant progress, the study's authors categorically state that high accuracy on Humanity's Last Exam does not indicate the arrival of Artificial General Intelligence

1

. "High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence," researchers stated. Manuel Schottdorf, a neuroscientist at the University of Delaware who contributed questions, emphasized that "doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence"

1

. The evaluation of AI progress requires understanding that intelligence involves depth, context, and specialized expertise beyond pattern recognition

2

.

What This Means for Developers and Policymakers

"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," Dr. Nguyen explained

2

. The benchmark provides a foundation for measuring progress and identifying risks, helping stakeholders avoid getting swept away by hype. When systems perform well on familiar tests like MMLU, created with participation from Center for AI Safety founder Dan Hendrycks, it becomes tempting to assume they approach human-level understanding

1

2

. However, Gemini's Deep Think achieved 84.6% on the ARC-AGI-2 benchmark just a week after failing to reach 50% on Humanity's Last Exam, demonstrating how different tests measure different capabilities

1

. The team made only part of the exam public while keeping most questions hidden, preventing future systems from simply training on the full answer set and maintaining the test's integrity for ongoing AI reasoning assessment

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo