Curated by THEOUTPOST
On Tue, 4 Feb, 4:03 PM UTC
2 Sources
[1]
OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake
The world's hardest AI exam, Humanity's Last Exam, was launched less than two weeks ago, and we've already seen a huge jump in accuracy, with ChatGPT o3-mini and now OpenAI's Deep Reasoning topping the leaderboard. The AI benchmark created by experts from around the world contains some of the hardest reasoning problems and questions known to man - it's so hard, that when I previously wrote about Humanity's Last Exam in the article linked above, I couldn't even understand one of the questions, let alone answer it. At the time of writing that last article, world phenomenon DeepSeek R1 sat at the top of the leaderboard with a 9.4% accuracy score when evaluated only on text (not multi-modal). Now, OpenAI's o3-mini, which launched earlier this week, has scored 10.5% accuracy at the o3-mini setting, and 13% accuracy at the o3-mini-high setting, which is more intelligent but takes longer to generate answers. More impressive, however, is OpenAI's new AI agent Deep Research's score on the benchmark, with the new tool scoring 26.6%, a whopping 183% increase in result accuracy in less than 10 days. Now, it's worth noting that Deep Research has search capabilities which make comparisons slightly unfair, as the other AI models don't. The ability to search the web is helpful for a test like Humanity's Last Exam, as it includes some general knowledge-based questions. That said, the accuracy of results by models taking Humanity's Last Exam results is steadily improving, and it does make you wonder just how long we'll need to wait to see an AI model come close to completing the benchmark. Realistically, AI shouldn't be able to come close any time soon, but I wouldn't bet against it. OpenAI Deep Research is an incredibly impressive tool, and I've been blown away by the examples that OpenAI showed off when it announced the AI agent. Deep Research is able to work as your personal analyst, taking time to conduct intense research and come up with reports and answers that would otherwise take humans hours and hours to complete. While a score of 26.6% on Humanity's Last Exam is seriously impressive, especially considering how far the benchmark's leaderboard has come in just a couple of weeks, it's still a low score in absolute terms - no one would claim to have passed a test with anything less than 50% in the real world. Humanity's Last Exam is an excellent benchmark, and one that will prove invaluable as AI models develop, enabling us to gauge just how far they've come. How long will we have to wait to see an AI bypass the 50% mark? And which model will be the first to do so?
[2]
Humanity's Last Exam Explained - The ultimate AI benchmark that sets the tone of our AI future
Artificial intelligence (AI) has been evolving at breakneck speed, with models like OpenAI's GPT-4 and DeepSeek's R1 pushing the boundaries of what machines can do. We are in an era where artificial intelligence (AI) systems can write poetry, diagnose diseases, and even drive cars. At this moment, we have a new benchmark that has emerged with the promise of redifining humanity's relationship with technology. Dubbed the "Humanity's Last Exam," this ambitious evaluation is being hailed as the definitive test to determine whether AI can match - or surpass - human-level reasoning, creativity, and ethical judgment. But what exactly is the Humanity's Last Exam? Why are experts calling it humanity's "final test"? And should we be excited or concerned about its implications? Let's break it down. The problem with existing AI benchmarks is pretty simple: models are acing them too easily. Take the Massive Multitask Language Understanding (MMLU) benchmark, for example. It was once the gold standard for evaluating AI general knowledge, but today's top AI models are hitting 90%+ accuracy on it. That sounds impressive - until you realize that many of these tests weren't designed to handle the reasoning, creativity, or multi-modal capabilities (text + image processing) that cutting-edge AI systems are starting to develop. HLE was created specifically to push AI models to their limits. Developed by the Center for AI Safety (CAIS) and Scale AI, it introduces a tougher, more comprehensive challenge that better reflects real-world intelligence. Also Read: DeepSeek AI: Beyond ChatGPT, 5 ways DeepSeek is rewriting AI rules Humanity's Last Exam is built to simulate real expert-level problem-solving rather than just regurgitating memorized facts. One of its defining features is its massive scale. The exam consists of 3,000 highly challenging questions that span more than 100 different subjects, ranging from mathematics and physics to law, medicine, and philosophy. Unlike many previous AI benchmarks that were primarily designed by researchers, Humanity's Last Exam's questions were crowdsourced from a global network of nearly 1,000 experts across 500+ institutions in 50 countries. This diversity ensures that the test reflects a broad spectrum of knowledge domains and problem-solving approaches. Another major distinction is its multi-modal challenge. While most AI benchmarks focus purely on text-based reasoning, HLE incorporates a mix of both text and image-based questions, with 10% of the exam requiring AI systems to process visual information alongside written context. This added layer of complexity makes it much harder for AI models to succeed using simple pattern recognition alone. Instead, they must demonstrate the ability to integrate different types of information - something that remains a major challenge for even the most advanced AI systems today. To further prevent AI from "gaming" the test, some of the toughest questions in HLE are kept hidden from public datasets. This is a critical improvement over older benchmarks, where AI companies could simply train their models on the test questions to artificially boost their scores. By introducing a level of secrecy, HLE ensures that models must exhibit genuine problem-solving ability rather than just memorization. Also Read: OpenAI launches Operator: How will this AI agent impact the industry? So far? Not great. Even the best AI models today are struggling with HLE, with most scoring in the single digits or low double digits. Here's how some notable models have fared: Compare this to older benchmarks like MMLU, where top AI models regularly exceed 90% accuracy, and you can see just how much harder HLE is. This tells us that while AI models may look impressive based on older tests, they're still far from mastering complex reasoning and real-world problem-solving. The fact that no model has come close to human-level performance on HLE suggests that we still have a long way to go before AI reaches true expert-level proficiency. Also Read: AI adoption fail: 80 per cent of companies neglect human factors critical for AI success HLE isn't just a tougher exam - it's a reality check for AI development. As AI systems keep improving, benchmarks like this will be essential in separating hype from actual progress. One of the biggest takeaways from HLE is that it exposes AI weaknesses that still need to be addressed. Today's models struggle with deep reasoning, multi-modal understanding, and tackling entirely new types of problems. Rather than just showing what AI can do well, HLE provides clear evidence of where it still falls short. This kind of insight is invaluable for researchers and developers looking to build more capable AI systems. Beyond identifying weaknesses, HLE also helps set a new standard for AI development. AI companies will no longer be able to claim groundbreaking progress based solely on outdated benchmarks. Instead, they'll have to prove that their models can handle the kinds of real-world challenges that actually matter. This could lead to more meaningful advancements in AI, with models that are better equipped to assist in high-stakes fields like science, medicine, and law. Perhaps most importantly, HLE introduces a new level of accountability into AI development. There has been growing concern that AI companies are prioritizing flashy but superficial improvements rather than real progress. By creating a much tougher, more transparent benchmark, HLE forces companies to build AI models that actually perform well under pressure, rather than just looking good in controlled settings. Also Read: OpenAI launches ChatGPT DeepResearch - 5 things you need to know AI is advancing faster than ever, but progress isn't just about getting higher scores on outdated tests. Humanity's Last Exam is the next evolution in AI benchmarking, forcing models to prove their intelligence in ways that actually matter. If AI can start excelling at HLE, we'll know we're truly moving towards systems that don't just memorize information, but actually understand and apply it - a major step toward more useful, reliable, and even trustworthy AI. Until then, expect AI companies to be laser-focused on improving reasoning, problem-solving, and multi-modal capabilities - because if they want to claim their models are the best, they'll have to pass Humanity's Last Exam first.
Share
Share
Copy Link
OpenAI's Deep Research achieves a record-breaking 26.6% accuracy on Humanity's Last Exam, a new benchmark designed to test the limits of AI reasoning and problem-solving abilities across diverse fields.
In a significant leap forward for artificial intelligence, OpenAI's Deep Research has achieved a groundbreaking score of 26.6% accuracy on Humanity's Last Exam (HLE), a newly established benchmark designed to push AI systems to their limits 1. This result represents a staggering 183% increase in accuracy compared to previous top performers, setting a new standard for AI capabilities in complex reasoning and problem-solving.
HLE, developed by the Center for AI Safety (CAIS) and Scale AI, is considered the world's hardest AI exam. It comprises 3,000 challenging questions spanning over 100 subjects, including mathematics, physics, law, medicine, and philosophy 2. Unlike previous benchmarks, HLE incorporates both text and image-based questions, with 10% of the exam requiring visual processing alongside written context.
The AI community has witnessed remarkable progress in a short span of time. Just days before Deep Research's achievement, other models had set impressive benchmarks:
It's worth noting that Deep Research's exceptional performance is partly attributed to its web search capabilities, which are not available to other AI models. This feature provides an advantage in addressing general knowledge questions included in the exam 1.
HLE represents a critical shift in how AI progress is measured and evaluated:
Exposing AI Weaknesses: The exam reveals areas where AI still struggles, such as deep reasoning and multi-modal understanding 2.
Setting New Standards: HLE challenges AI companies to focus on meaningful advancements rather than superficial improvements 2.
Increasing Accountability: The benchmark introduces transparency and forces AI models to perform under pressure, mimicking real-world scenarios 2.
While Deep Research's 26.6% accuracy on HLE is impressive, it still falls short of what would be considered a passing grade in human terms. This underscores the significant challenges that remain in developing AI systems capable of human-level reasoning across diverse fields 1.
As AI continues to evolve rapidly, HLE will likely play a crucial role in gauging progress and directing research efforts. The AI community now faces the exciting challenge of pushing beyond current limitations, with many wondering how long it will take for an AI model to surpass the 50% mark on this rigorous exam 12.
Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.
7 Sources
7 Sources
A group of AI researchers is developing a comprehensive test called "Humanity's Last Exam" to assess the capabilities and limitations of advanced AI systems. This initiative aims to identify potential risks and ensure responsible AI development.
9 Sources
9 Sources
Researchers are developing a comprehensive test to measure AI capabilities, dubbed "Humanity's Last Exam." This collaborative effort aims to create benchmarks for assessing when AI reaches or surpasses human-level intelligence.
2 Sources
2 Sources
As AI models like OpenAI's o3 series surpass human-level performance on various benchmarks, including complex mathematical problems, the need for more sophisticated evaluation methods becomes apparent.
2 Sources
2 Sources
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
8 Sources
8 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved