2 Sources
2 Sources
[1]
Acing this new AI exam -- which its creators say is the toughest in the world -- might point to the first signs of AGI
Humanity's Last Exam is a PhD-level benchmark designed to test the limits of AI reasoning. Although Google's Gemini 3 scored a staggering 48.4%, experts stress that this does not indicate the arrival of artificial general intelligence (AGI). Researchers at the Center for AI Safety and Scale AI have published "Humanity's Last Exam" -- a test designed to measure how close today's most powerful artificial intelligence (AI) models are to meeting or exceeding human-level knowledge across several domains. The test was launched in January 2025, but scientists outlined the framework and their thinking behind its design for the first time in a new study published Jan. 28 in the journal Nature. It contains a corpus of 2,500 questions across more than 100 subjects, with input from more than 1,000 subject-matter experts from 500 institutions across 50 countries. The exam consists of multiple-choice and short-answer questions, each of which has a known solution that is "unambiguous and easily verifiable but cannot be quickly answered by internet retrieval." At launch, the researchers tested OpenAI's GPT-4o and o1 models, Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet and DeepSeek R1. OpenAI's o1 system notched the top spot with a score of just 8.3%. Despite this poor performance, the researchers wrote at the time that "given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025." As of Feb. 12, 2026, the highest score achieved so far is 48.4%, set by Google's Gemini 3 Deep Think. Human experts, meanwhile, score around 90% in their respective domains. Testing the smartest machines in the world Humanity's Last Exam was intentionally designed to be extremely difficult for AI models. During early development, the researchers put out a global call for submissions from subject matter experts across numerous domains. The researchers enforced strict submission criteria requiring questions to be precise, unambiguous, solvable and non-searchable. They didn't want models to cheat by performing a simple web search, or for any of the questions to already appear online -- thus increasing the likelihood a given model would have the answer in its training dataset. Each question submitted was then fed to the AI models. The team automatically rejected any questions the models could answer correctly. More than 70,000 submissions were attempted, resulting in approximately 13,000 questions that stumped LLMs. These were then vetted by a team of subject matter experts, approved by the research team, and presented to the scientific community for open feedback. Ultimately, the researchers narrowed the total submissions down to 2,500 questions that generally fall within the realm of PhD-level testing. An example of a trivia question in the exam is: "In Greek mythology, who was Jason's maternal great-grandfather?" Meanwhile, an example of a physics question asks for the relationship between different forces during motion in a scenario where a block is placed on a horizontal rail (and can slide frictionlessly) while also being attached to a rigid, massless rod of an unknown length. The breadth of questions and scope of subjects covered by Humanity's Last Exam sets it apart from similar benchmarking tools, its creators say. Common tests, such as the Massive Multitask Language Understanding (MMLU) dataset, which was authored with participation from Center for AI Safety founder Dan Hendrycks, only test a small subset of expert-level domain knowledge, primarily focusing on coding and mathematics. Even state-of-the-art benchmarks such as Francois Chollet's ARC-AGI suite struggle to outpace the memorization and searchability problems that the creators of Humanity's Last Exam suggest the new test addresses. Gemini's Deep Think, for example, achieved 84.6% on the ARC-AGI-2 benchmark, just a week after failing to reach 50% on the HLE test. The ultimate prize is general intelligence Humanity's Last Exam likely represents the AI world's best attempt to date at measuring the broad-spectrum capabilities of modern AI models relative to human experts, but the study's authors categorically state that achieving a high score on the HLE is in no way indicative of the arrival of artificial general intelligence (AGI). "High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence," the scientists said in the study. "Doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence," Manuel Schottdorf, a neuroscientist at the University of Delaware's Department of Psychological and Brain Sciences, said in a recent statement. Schottdorf is one of the many experts whose question was accepted into the HLE's corpus. "They will have to be good enough to solve these questions, but that as a fact alone can't allow us to conclude that machines are truly intelligent."
[2]
Humanity's Last Exam pushed AI to its limits - but did it pass?
A few years ago, the big question in AI was whether a system could pass the kinds of exams humans struggle with. Now, the problem has flipped. Some artificial intelligence systems started cruising through well-known academic benchmarks, and that success created an awkward reality: the tests weren't telling us much anymore. Take the Massive Multitask Language Understanding exam, better known as MMLU. It used to look tough. Then newer models began racking up strong scores, and researchers realized that "passing" didn't necessarily mean the systems truly understood what they were doing. A group of researchers decided to stop tweaking old tests and make a new one from scratch. Nearly 1,000 experts from around the world joined forces to create a massive assessment designed to sit just past what today's AI can handle. They called it "Humanity's Last Exam" (HLE). It's a 2,500-question challenge that ranges across mathematics, humanities, natural sciences, ancient languages, and narrow specialties most people never bump into outside a lab or a library. The work was described in a paper, with project documentation available at lastexam.ai. After the idea was already in motion, one contributor who helped shape the test was Dr. Tung Nguyen, an instructional associate professor in the Department of Computer Science and Engineering at Texas A&M University. Dr. Nguyen helped author and refine questions, including a big chunk of what's publicly available. The questions themselves aren't the usual multiple-choice trivia. They come from the kind of knowledge that takes years to build, and they're written to be cleanly graded. Experts wrote and reviewed each one to make sure it had a single, unambiguous, verifiable answer, and that it couldn't be solved instantly by grabbing something off the internet. Some prompts reach into topics like translating ancient Palmyrene inscriptions, spotting microanatomical structures in birds, or parsing the fine points of Biblical Hebrew pronunciation. Here's the part that makes HLE feel different from older benchmarks: the creators tested questions against leading AI models as they built the exam. If a system could answer a question correctly, that question didn't make the cut. The goal wasn't to be cute or cruel. It was to pinpoint where current AI still falls short, in a way that can be measured. "When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding," said Dr. Nguyen. "But HLE reminds us that intelligence isn't just about pattern recognition -- it's about depth, context and specialized expertise." Once the exam took shape, the results landed with a thud. Early testing showed that even high-profile systems struggled to get traction. GPT-4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI's flagship o1 model achieved only 8%. More recent top-end systems have done better, but not in a way that makes the gap disappear: Gemini 3.1 Pro and Claude Opus 4.6 have reached around 40% to 50% accuracy. Those numbers are the point. HLE isn't trying to flatter AI. It's trying to keep the yardstick honest. Dr. Nguyen's involvement also shows how much hands-on work goes into a benchmark like this. He contributed 73 of the 2,500 public questions, the second-highest author, and he wrote the most questions in math and computer science. "Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," he said. "Benchmarks provide the foundation for measuring progress and identifying risks." The concern is simple: when a test gets too familiar, it stops measuring what people think it measures. A system can look impressive on human-designed exams without matching human understanding in the messy, context-heavy way real life demands. The title sounds dramatic, but the purpose is practical. The exam is meant to map strengths and weak spots so developers and decision-makers don't get swept away by hype. "This isn't a race against AI," said Dr. Nguyen. "It's a method for understanding where these systems are strong and where they struggle." "That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters." A big challenge with any AI test is that models learn fast, and they also memorize fast. The team behind Humanity's Last Exam tried to plan for that by making only part of the exam public while keeping most questions hidden, so future systems can't simply train on the full answer set. "For now, Humanity's Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence and despite rapid technological advances, it remains wide," said Dr. Nguyen. One of the most interesting parts of HLE isn't the name or the scoreboard. It's the way it was made: experts across fields working together to write questions that reflect real specialist knowledge, not just clever wordplay or test-taking tricks. "What made this project extraordinary was the scale," said Dr. Nguyen. "Experts from nearly every discipline contributed. It wasn't just computer scientists; it was historians, physicists, linguists, medical researchers." "That diversity is exactly what exposes the gaps in today's AI systems - perhaps ironically, it's humans working together." Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Share
Share
Copy Link
Researchers created Humanity's Last Exam, a PhD-level AI benchmark with 2,500 questions designed to assess the limits of AI reasoning. Google's Gemini 3 Deep Think achieved the highest score at 48.4%, while human experts score around 90%. Despite progress, experts emphasize this doesn't signal the arrival of Artificial General Intelligence.
Researchers at the Center for AI Safety and Scale AI have launched Humanity's Last Exam, a highly challenging PhD-level benchmark designed to measure AI capabilities against human expertise across more than 100 subjects
1
. The test contains 2,500 questions developed with input from more than 1,000 subject-matter experts from 500 institutions across 50 countries, representing one of the most comprehensive efforts to evaluate AI progress1
. Published in the journal Nature in January 2025, this AI benchmark addresses a critical problem: existing tests like MMLU had become too easy for modern systems, making it difficult to accurately measure where leading AI models truly stand2
.
Source: Earth.com
When Humanity's Last Exam launched in January 2025, the results were sobering. OpenAI's GPT-4o scored just 2.7%, while Claude 3.5 Sonnet from Anthropic reached only 4.1%
2
. Even OpenAI's flagship o1 model achieved merely 8.3%, despite being among the most advanced systems available1
. By February 2026, Google's Gemini 3 Deep Think set the current record at 48.4%, while human experts consistently score around 90% in their respective domains1
. More recent systems including Gemini 3.1 Pro and Claude Opus 4.6 have reached approximately 40% to 50% accuracy, showing progress but still revealing a substantial gap between AI and human intelligence2
.
Source: Live Science
The exam's creators implemented strict criteria requiring questions to be precise, unambiguous, solvable, and critically, non-searchable
1
. Researchers didn't want models to cheat through simple web searches or encounter questions already present in their training dataset. During development, each question was tested against AI models, and any that systems could answer correctly were automatically rejected1
. From more than 70,000 submissions, approximately 13,000 stumped the large language models, which were then vetted by subject matter experts before being narrowed to the final 2,500 questions1
. Dr. Tung Nguyen from Texas A&M University contributed 73 questions, the second-highest author count, focusing on mathematics and computer science2
.Related Stories
Despite Gemini 3 Deep Think's 48.4% score representing significant progress, the study's authors categorically state that high accuracy on Humanity's Last Exam does not indicate the arrival of Artificial General Intelligence
1
. "High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence," researchers stated. Manuel Schottdorf, a neuroscientist at the University of Delaware who contributed questions, emphasized that "doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence"1
. The evaluation of AI progress requires understanding that intelligence involves depth, context, and specialized expertise beyond pattern recognition2
."Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do," Dr. Nguyen explained
2
. The benchmark provides a foundation for measuring progress and identifying risks, helping stakeholders avoid getting swept away by hype. When systems perform well on familiar tests like MMLU, created with participation from Center for AI Safety founder Dan Hendrycks, it becomes tempting to assume they approach human-level understanding1
2
. However, Gemini's Deep Think achieved 84.6% on the ARC-AGI-2 benchmark just a week after failing to reach 50% on Humanity's Last Exam, demonstrating how different tests measure different capabilities1
. The team made only part of the exam public while keeping most questions hidden, preventing future systems from simply training on the full answer set and maintaining the test's integrity for ongoing AI reasoning assessment2
.Summarized by
Navi
[1]
24 Jan 2025β’Science and Research

04 Feb 2025β’Technology

17 Sept 2024

1
Technology

2
Policy and Regulation

3
Science and Research
