OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

OpenAI's Deep Research Shatters Records on Humanity's Last Exam

In a significant leap forward for artificial intelligence, OpenAI's Deep Research has achieved a groundbreaking score of 26.6% accuracy on Humanity's Last Exam (HLE), a newly established benchmark designed to push AI systems to their limits 1

. This result represents a staggering 183% increase in accuracy compared to previous top performers, setting a new standard for AI capabilities in complex reasoning and problem-solving.

Understanding Humanity's Last Exam

HLE, developed by the Center for AI Safety (CAIS) and Scale AI, is considered the world's hardest AI exam. It comprises 3,000 challenging questions spanning over 100 subjects, including mathematics, physics, law, medicine, and philosophy 2

. Unlike previous benchmarks, HLE incorporates both text and image-based questions, with 10% of the exam requiring visual processing alongside written context.

The Rapid Progress of AI Models

The AI community has witnessed remarkable progress in a short span of time. Just days before Deep Research's achievement, other models had set impressive benchmarks:

DeepSeek R1: 9.4% accuracy (text-only evaluation)
OpenAI's o3-mini: 10.5% accuracy (standard setting)
OpenAI's o3-mini-high: 13% accuracy (more intelligent but slower setting) 1
1

Deep Research's Distinctive Advantage

It's worth noting that Deep Research's exceptional performance is partly attributed to its web search capabilities, which are not available to other AI models. This feature provides an advantage in addressing general knowledge questions included in the exam 1

The Significance of HLE in AI Development

HLE represents a critical shift in how AI progress is measured and evaluated:

Exposing AI Weaknesses: The exam reveals areas where AI still struggles, such as deep reasoning and multi-modal understanding 2
2
.
Setting New Standards: HLE challenges AI companies to focus on meaningful advancements rather than superficial improvements 2
2
.
Increasing Accountability: The benchmark introduces transparency and forces AI models to perform under pressure, mimicking real-world scenarios 2
2
.

The Road Ahead

While Deep Research's 26.6% accuracy on HLE is impressive, it still falls short of what would be considered a passing grade in human terms. This underscores the significant challenges that remain in developing AI systems capable of human-level reasoning across diverse fields 1

As AI continues to evolve rapidly, HLE will likely play a crucial role in gauging progress and directing research efforts. The AI community now faces the exciting challenge of pushing beyond current limitations, with many wondering how long it will take for an AI model to surpass the 50% mark on this rigorous exam 1

OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

OpenAI's Deep Research Shatters Records on Humanity's Last Exam

Understanding Humanity's Last Exam

The Rapid Progress of AI Models

Deep Research's Distinctive Advantage

The Significance of HLE in AI Development

The Road Ahead

References

OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake

Humanity's Last Exam Explained - The ultimate AI benchmark that sets the tone of our AI future

Related Stories

New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

AI Experts Prepare "Humanity's Last Exam" to Challenge Advanced AI Systems

New AGI Benchmark Stumps Leading AI Models, Highlighting Gap in General Intelligence

Weekly Highlights

OpenAI Releases GPT-5.1 with Customizable Personalities Amid Growing Legal Pressures

Anthropic Secures $45 Billion in Strategic Partnerships with Microsoft and Nvidia

Jeff Bezos Returns as Co-CEO of $6.2B AI Startup Project Prometheus

Weekly Highlights

Today's Top Stories

Google Unveils Gemini 3 AI Model with Record-Breaking Performance and New Coding IDE

Nvidia's Memory Chip Shift Could Double Server Prices by 2026

TikTok Introduces AI Content Control Slider to Combat AI Slop

Google Unveils Antigravity: An Agent-First Coding Platform Built on Gemini 3 Pro