Humanity's Last Exam: AI Benchmark Shows AGI Limits

A New Standard to Assess the Limits of AI Reasoning

Researchers at the Center for AI Safety and Scale AI have launched Humanity's Last Exam, a highly challenging PhD-level benchmark designed to measure AI capabilities against human expertise across more than 100 subjects1

. The test contains 2,500 questions developed with input from more than 1,000 subject-matter experts from 500 institutions across 50 countries, representing one of the most comprehensive efforts to evaluate AI progress1

. Published in the journal Nature in January 2025, this AI benchmark addresses a critical problem: existing tests like MMLU had become too easy for modern systems, making it difficult to accurately measure where leading AI models truly stand2

Source: Earth.com

Leading AI Models Scoring Low Despite Advanced Capabilities

When Humanity's Last Exam launched in January 2025, the results were sobering. OpenAI's GPT-4o scored just 2.7%, while Claude 3.5 Sonnet from Anthropic reached only 4.1%2

. Even OpenAI's flagship o1 model achieved merely 8.3%, despite being among the most advanced systems available1

. By February 2026, Google's Gemini 3 Deep Think set the current record at 48.4%, while human experts consistently score around 90% in their respective domains1

. More recent systems including Gemini 3.1 Pro and Claude Opus 4.6 have reached approximately 40% to 50% accuracy, showing progress but still revealing a substantial gap between AI and human intelligence2

Source: Live Science

Building Non-Searchable Questions to Prevent Training Dataset Contamination

The exam's creators implemented strict criteria requiring questions to be precise, unambiguous, solvable, and critically, non-searchable1

. Researchers didn't want models to cheat through simple web searches or encounter questions already present in their training dataset. During development, each question was tested against AI models, and any that systems could answer correctly were automatically rejected1

. From more than 70,000 submissions, approximately 13,000 stumped the large language models, which were then vetted by subject matter experts before being narrowed to the final 2,500 questions1

. Dr. Tung Nguyen from Texas A&M University contributed 73 questions, the second-highest author count, focusing on mathematics and computer science2

Why High Scores Don't Signal Artificial General Intelligence

Despite Gemini 3 Deep Think's 48.4% score representing significant progress, the study's authors categorically state that high accuracy on Humanity's Last Exam does not indicate the arrival of Artificial General Intelligence1

. "High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence," researchers stated. Manuel Schottdorf, a neuroscientist at the University of Delaware who contributed questions, emphasized that "doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence"1

. The evaluation of AI progress requires understanding that intelligence involves depth, context, and specialized expertise beyond pattern recognition2

What This Means for Developers and Policymakers

"Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," Dr. Nguyen explained2

. The benchmark provides a foundation for measuring progress and identifying risks, helping stakeholders avoid getting swept away by hype. When systems perform well on familiar tests like MMLU, created with participation from Center for AI Safety founder Dan Hendrycks, it becomes tempting to assume they approach human-level understanding1

. However, Gemini's Deep Think achieved 84.6% on the ARC-AGI-2 benchmark just a week after failing to reach 50% on Humanity's Last Exam, demonstrating how different tests measure different capabilities1

. The team made only part of the exam public while keeping most questions hidden, preventing future systems from simply training on the full answer set and maintaining the test's integrity for ongoing AI reasoning assessment2

Humanity's Last Exam reveals the gap between AI and human intelligence despite rapid progress

A New Standard to Assess the Limits of AI Reasoning

Leading AI Models Scoring Low Despite Advanced Capabilities

Building Non-Searchable Questions to Prevent Training Dataset Contamination

Why High Scores Don't Signal Artificial General Intelligence

What This Means for Developers and Policymakers

References

Acing this new AI exam -- which its creators say is the toughest in the world -- might point to the first signs of AGI

Humanity's Last Exam pushed AI to its limits - but did it pass?

Related Stories

New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

AI Experts Prepare "Humanity's Last Exam" to Challenge Advanced AI Systems

Recent Highlights

Samsung unveils Galaxy S26 lineup with Privacy Display tech and expanded AI capabilities

Anthropic refuses Pentagon's ultimatum over AI use in mass surveillance and autonomous weapons

AI models deploy nuclear weapons in 95% of war games, raising alarm over military use

Recent Highlights

Today's Top Stories

Block cuts 4,000 jobs as Jack Dorsey bets AI can replace nearly half its workforce

ChatGPT reaches 900 million weekly active users as OpenAI secures $110 billion funding round

Microsoft unveils Copilot Tasks, an AI assistant that automates work while you focus elsewhere

Google and Meta strike multibillion-dollar AI chip deal as tech giants race to scale AI infrastructure