OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

2 Sources

OpenAI's Deep Research achieves a record-breaking 26.6% accuracy on Humanity's Last Exam, a new benchmark designed to test the limits of AI reasoning and problem-solving abilities across diverse fields.

News article

OpenAI's Deep Research Shatters Records on Humanity's Last Exam

In a significant leap forward for artificial intelligence, OpenAI's Deep Research has achieved a groundbreaking score of 26.6% accuracy on Humanity's Last Exam (HLE), a newly established benchmark designed to push AI systems to their limits 1. This result represents a staggering 183% increase in accuracy compared to previous top performers, setting a new standard for AI capabilities in complex reasoning and problem-solving.

Understanding Humanity's Last Exam

HLE, developed by the Center for AI Safety (CAIS) and Scale AI, is considered the world's hardest AI exam. It comprises 3,000 challenging questions spanning over 100 subjects, including mathematics, physics, law, medicine, and philosophy 2. Unlike previous benchmarks, HLE incorporates both text and image-based questions, with 10% of the exam requiring visual processing alongside written context.

The Rapid Progress of AI Models

The AI community has witnessed remarkable progress in a short span of time. Just days before Deep Research's achievement, other models had set impressive benchmarks:

  1. DeepSeek R1: 9.4% accuracy (text-only evaluation)
  2. OpenAI's o3-mini: 10.5% accuracy (standard setting)
  3. OpenAI's o3-mini-high: 13% accuracy (more intelligent but slower setting) 1

Deep Research's Distinctive Advantage

It's worth noting that Deep Research's exceptional performance is partly attributed to its web search capabilities, which are not available to other AI models. This feature provides an advantage in addressing general knowledge questions included in the exam 1.

The Significance of HLE in AI Development

HLE represents a critical shift in how AI progress is measured and evaluated:

  1. Exposing AI Weaknesses: The exam reveals areas where AI still struggles, such as deep reasoning and multi-modal understanding 2.

  2. Setting New Standards: HLE challenges AI companies to focus on meaningful advancements rather than superficial improvements 2.

  3. Increasing Accountability: The benchmark introduces transparency and forces AI models to perform under pressure, mimicking real-world scenarios 2.

The Road Ahead

While Deep Research's 26.6% accuracy on HLE is impressive, it still falls short of what would be considered a passing grade in human terms. This underscores the significant challenges that remain in developing AI systems capable of human-level reasoning across diverse fields 1.

As AI continues to evolve rapidly, HLE will likely play a crucial role in gauging progress and directing research efforts. The AI community now faces the exciting challenge of pushing beyond current limitations, with many wondering how long it will take for an AI model to surpass the 50% mark on this rigorous exam 12.

Explore today's top stories

Nvidia's Blackwell GPUs Dominate Latest MLPerf AI Training Benchmarks

Nvidia's new Blackwell GPUs show significant performance gains in AI model training, particularly for large language models, according to the latest MLPerf benchmarks. AMD's latest GPUs show progress but remain a generation behind Nvidia.

IEEE Spectrum logoReuters logoNVIDIA Blog logo

5 Sources

Technology

19 hrs ago

Nvidia's Blackwell GPUs Dominate Latest MLPerf AI Training

Reddit Sues Anthropic Over Alleged Unauthorized Use of Data for AI Training

Reddit has filed a lawsuit against AI startup Anthropic, accusing the company of using Reddit's data without permission to train its AI models, including the chatbot Claude. This legal action marks a significant moment in the ongoing debate over AI companies' use of online content for training purposes.

TechCrunch logoThe Verge logoReuters logo

14 Sources

Policy and Regulation

19 hrs ago

Reddit Sues Anthropic Over Alleged Unauthorized Use of Data

OpenAI Reaches 3 Million Business Users, Unveils New Workplace AI Tools

OpenAI announces a significant increase in its business user base and introduces new AI-powered features for the workplace, intensifying competition in the enterprise AI market.

CNBC logoVentureBeat logoNBC News logo

3 Sources

Technology

19 hrs ago

OpenAI Reaches 3 Million Business Users, Unveils New

Apple Intelligence Rollout in China Stalled Amid US-China Trade Tensions

Apple's partnership with Alibaba to launch AI services in China faces regulatory hurdles due to escalating trade war between the US and China, potentially impacting iPhone sales in a key market.

Reuters logoFinancial Times News logo9to5Mac logo

7 Sources

Business and Economy

19 hrs ago

Apple Intelligence Rollout in China Stalled Amid US-China

OpenAI and Anthropic Intensify AI Coding Race with Advanced Tools

OpenAI and Anthropic are competing to develop advanced AI coding tools, with OpenAI's Codex now available to ChatGPT Plus users and Anthropic's Claude aiming to be the world's best coding model.

ZDNet logoPC Magazine logo

2 Sources

Technology

19 hrs ago

OpenAI and Anthropic Intensify AI Coding Race with Advanced
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2025 Triveous Technologies Private Limited
Twitter logo
Instagram logo
LinkedIn logo