New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

7 Sources

Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.

News article

New Benchmark Challenges Top AI Models

Scale AI and the Center for AI Safety (CAIS) have introduced a groundbreaking new AI benchmark called "Humanity's Last Exam" (HLE), designed to test the limits of AI knowledge at the frontiers of human expertise 12. This benchmark aims to address the issue of "benchmark saturation," where AI models have been rapidly excelling on standard tests, making it difficult to accurately gauge their capabilities 3.

Comprehensive and Challenging Test Design

The HLE consists of 3,000 questions covering over 100 subjects in mathematics, science, and humanities 1. These questions were carefully selected from an initial pool of 70,000, with input from nearly 1,000 subject expert contributors across 500 institutions in 50 countries 24. The benchmark includes multiple-choice and short-answer questions, as well as multi-modal elements incorporating text, diagrams, and images 4.

Performance of Top AI Models

In initial testing, current AI models struggled significantly with the HLE:

  • OpenAI's GPT-4o: 3.3% accuracy
  • Anthropic's Claude 3.5 Sonnet: 4.3% accuracy
  • Google's Gemini 1.5 Pro: 6.2% accuracy
  • OpenAI's o1: 9.1% accuracy
  • DeepSeek-R1: 9.4% accuracy 35

These results stand in stark contrast to the high scores (often over 90%) that many of these models achieve on other popular benchmarks like MMLU, MATH, and GPQA 12.

Implications for AI Development

The poor performance of top AI models on the HLE reveals that there are still significant gaps in AI capabilities when it comes to expert-level knowledge and complex reasoning 2. Dan Hendrycks, co-founder and executive director of CAIS, noted that while it's uncertain how quickly models will advance, the HLE currently demonstrates that there are still expert-level questions that AI models cannot answer 12.

Future Outlook

While the current results show a clear limitation in AI capabilities, researchers are cautious about making long-term predictions. Given the rapid pace of AI advancement, it's considered plausible that models could reach over 50% accuracy on the HLE by the end of the year 2. However, the benchmark's creators emphasize that such an achievement would not necessarily indicate autonomous research capabilities or artificial general intelligence 2.

Ongoing Research and Accessibility

CAIS and Scale AI plan to release the HLE dataset to researchers for further study of AI systems and their limitations 1. The benchmark remains open for additional test questions, though cash prizes are no longer being awarded 1. This initiative represents an important step in creating more challenging and comprehensive evaluations of AI capabilities as the field continues to evolve rapidly.

Explore today's top stories

Elon Musk's xAI Sues Apple and OpenAI Over Alleged Anticompetitive iPhone AI Integration

Elon Musk's companies X and xAI have filed a lawsuit against Apple and OpenAI, alleging anticompetitive practices in the integration of ChatGPT into iOS, claiming it stifles competition in the AI chatbot market.

Ars Technica logoTechCrunch logoWired logo

50 Sources

Technology

15 hrs ago

Elon Musk's xAI Sues Apple and OpenAI Over Alleged

YouTube's Secret AI Video Enhancement Sparks Controversy Among Creators

YouTube has been secretly testing AI-powered video enhancement on select Shorts, leading to backlash from creators who noticed unexpected changes in their content. The platform claims it's using traditional machine learning, not generative AI, to improve video quality.

Ars Technica logoGizmodo logoAndroid Police logo

7 Sources

Technology

15 hrs ago

YouTube's Secret AI Video Enhancement Sparks Controversy

IBM and AMD Join Forces to Advance Quantum-Centric Supercomputing

IBM and AMD announce a partnership to develop next-generation computing architectures that combine quantum computers with high-performance computing, aiming to solve complex problems beyond the reach of traditional computing methods.

Axios logoSilicon Republic logoInvestopedia logo

4 Sources

Technology

7 hrs ago

IBM and AMD Join Forces to Advance Quantum-Centric

The Dark Side of AI Chatbots: How Design Choices Fuel Delusions and Addiction

An investigation into how AI chatbot design choices, particularly sycophancy and anthropomorphization, are leading to concerning cases of AI-related psychosis and addiction among vulnerable users.

Ars Technica logoTechCrunch logoVentureBeat logo

5 Sources

Technology

15 hrs ago

The Dark Side of AI Chatbots: How Design Choices Fuel

Silicon Valley Giants Launch $100M Pro-AI Super PAC to Influence Midterm Elections

Leading tech firms and investors create a network of political action committees to advocate for AI-friendly policies and oppose strict regulations ahead of the 2026 midterms.

TechCrunch logoDecrypt logoSiliconANGLE logo

5 Sources

Policy

15 hrs ago

Silicon Valley Giants Launch $100M Pro-AI Super PAC to
TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo