New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

Curated by THEOUTPOST

On Fri, 24 Jan, 12:02 AM UTC

7 Sources

[1]

ZDNet

'Humanity's Last Exam' benchmark is stumping top AI models - can you do any better?

A new academic benchmark aims to 'test the limits of AI knowledge at the frontiers of human expertise.' So far, these LLMs are stumped. Are artificial intelligence (AI) models really surpassing human ability? Or are current tests just too easy for them? On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity's Last Exam (HLE), a new academic benchmark aiming to "test the limits of AI knowledge at the frontiers of human expertise," Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than 100 subjects like math, science, and humanities, submitted by experts in a variety of fields. Also: Roll over, Darwin: How Google DeepMind's 'mind evolution' could enhance AI thinking Anthropic's Michael Gerstenhaber, head of API technologies, noted to Bloomberg last fall that AI models frequently outpace benchmarks (part of why the Chatbot Arena leaderboard changes so rapidly when new models are released). For example, many LLMs now score over 90% on multi-task language understanding (MMLU), a commonly used benchmark. This is known as benchmark saturation. By contrast, Scale reported that current models only answered less than 10 percent of the HLE benchmark's questions correctly. Researchers from the two organizations collected over 70,000 questions for HLE initially, narrowing them to 13,000 that were reviewed by human experts and then distilled once more into the final 3,000. They tested the questions on top models like OpenAI's o1 and GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro alongside the MMLU, MATH, and GPQA benchmarks. "When I released the MATH benchmark -- a challenging competition mathematics dataset -- in 2021, the best model scored less than 10%; few predicted that scores higher than 90% would be achieved just three years later," said Dan Hendrycks, CAIS co-founder and executive director. "Right now, Humanity's Last Exam shows that there are still some expert closed-ended questions that models are not able to answer. We will see how long that lasts." Also: DeepSeek's new open-source AI model can outperform o1 for a fraction of the cost Scale and CAIS gave contributors cash prizes for the top questions: $5,000 went to each of the top 50, while the next best 500 received $500. Although the final questions are now public, the two organizations kept another set of questions private to address "model overfitting," or when a model is so closely trained to a dataset that it is unable to make accurate predictions on new data. The benchmark's creators note that they are still accepting test questions, but will no longer award cash prizes, though contributors are eligible for co-authorship. CAIS and Scale AI plan to release the dataset to researchers so that they can further study new AI systems and their limitations. You can view all benchmark and sample questions at lastexam.ai.

[2]

Quartz

A new AI benchmark called 'Humanity's Last Exam' stumped top models -- for now, at least

Despite facing increasingly harder tests, artificial intelligence models have been advancing quickly and passing even PhD-level exams with high scores, making it somewhat difficult to track just how good they're getting. But it seems the AI models have met their match -- at least for now. According to the results of a new benchmark called "Humanity's Last Exam" (HLE), top AI models from OpenAI, Google (GOOGL+0.77%), and Anthropic aren't quite "at the frontier of human knowledge" yet. The evaluation, developed by researchers at the Center for AI Safety (CAIS) and Scale AI, is "designed to be the final closed-ended academic benchmark of its kind with broad subject coverage." Basically, they claim it's the most difficult test these AI models have ever faced. The researchers evaluated several multimodal frontier models, including Anthropic's Claude Sonnet-3.5, Google's Gemini 1.5 Pro, and both OpenAI's GPT-4o and new reasoning model, o1. All of the models scored less than 10% on HLE -- much lower than on popular benchmarks such as Massive Multitask Language Understanding (MMLU) and graduate-level Google-Proof Q&A (GPQA). "We wanted problems that would test the capabilities of the models at the frontier of human knowledge and reasoning," Dan Hendrycks, co-founder and executive director of CAIS said in a statement. "We can't predict how quickly the models will advance... Right now, Humanity's Last Exam shows that there are still some expert closed-ended questions that models are not able to answer. We will see how long that lasts." Hendrycks is also an advisor to Scale, which he worked with to compile the more than 3,000 multiple-choice and short-answer questions across more than 100 subjects. Questions included asking for a translation for a Palmyrene script, and hummingbird anatomy. The researchers received exam questions from close to "1,000 subject expert contributors" from around the world, who they asked to submit the "toughest questions" they know. Prize money was offered for the top questions, while contributors whose questions were chosen were offered optional co-authorship. While the current top AI models failed the HLE, "recent history shows benchmarks are quickly saturated -- with models dramatically progressing from near-zero to near-perfect performance in a short timeframe," the researchers said. It's "plausible" that the AI models could reach higher than 50% accuracy on the HLE by the end of the year, the researchers said. However, that alone wouldn't "suggest autonomous research capabilities or 'artificial general intelligence,'" they added, referring to the point when AI systems will be believed to have reached and exceeded human-level capabilities. "HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI," the researchers said.

[3]

TechRadar

Could you pass 'Humanity's Last Exam'? Probably not, but neither can AI

Did you know some of the smartest people on the planet create benchmarks to test AI's capabilities at replicating human intelligence? Well, scarily enough most AI benchmarks are easily completed by artificial intelligence models, showcasing just how smart the likes of ChatGPT's GPT-4o, Google Gemini's 1.5, and even the new o3-mini really are. In the quest to create the hardest benchmark possible, Scale AI and the Center for AI Safety (CAIS) have teamed up to create Humanity's Last Exam, a test they're calling a "groundbreaking new AI benchmark that was designed to test the limits of AI knowledge at the frontiers of human expertise." I'm not a genius by any means, but I had a glance at some of these questions and let me tell you, they're ridiculously tough. So much so that only the brightest minds on the planet could probably answer them. This incredible degree of difficulty means that in testing current AI models were only able to answer fewer than 10 percent of the questions correctly. The original name for the test was 'Humanity's Last Stand', but that was changed to Exam, just to take away the slightly terrifying nature of the concept. The questions were crowdsourced, with expert contributors from over 500 institutions across 50 countries coming up with the hardest reasoning questions possible. The current Humanity's Last Exam dataset consists of 3,000 questions, and we've selected a few samples below to show you just how tricky it is. Can you pass Humanity's Last Exam? Good luck! How did you do? There's no shame in saying "not very well". I won't lie - I don't think I even understood what I was being asked in that second one. According to the initial results reported by CAIS and Scale AI, OpenAI's GPT-4o achieved 3.3% accuracy on Humanity's Last Exam, while Grok-2 achieved 3.8%, Claude 3.5 Sonnet 4.3%, Gemini 6.2%, o1 9.1%, and DeepSeek-R1 (purely text as it's not multi-modal) achieved 9.4%. Interestingly, Humanity's Last Exam is substantially harder for AI than any other benchmark out there, including the most popular options, GPQA, MATH, and MMLU. So what does this all mean? Well, we're still in the infancy of AI models with reasoning functionality, and while OpenAI's brand-new o3 and o3-mini is yet to take on this incredibly difficult benchmark, it's going to take a very long time for any LLM to come close to completing Humanity's Last Exam. It's worth bearing in mind however, that AI is evolving at a rapid rate, with new functionality being made available to users almost daily. Just this week OpenAI unveiled Operator, its first AI agent, and it shows huge promise in a future where AI can automate tasks that would otherwise require human input. For now, no AI can come close to completing Humanity's Last Exam, but when one does... well, we could be in trouble.

[4]

Analytics India Magazine

Humanity's Last Exam is the New MultiAgent AI Benchmark

The AI field welcomes a new benchmark: Humanity's Last Exam (HLE), introduced by the Center for AI Safety (CAIS) and Scale AI for testing AI systems on expert-level knowledge. The dataset includes 3,000 questions crowdsourced from 1,000 contributors across 500 institutions in 50 countries, including professors and PhD holders. It covers mathematics, humanities, and natural sciences using the multi-format approach that includes text, diagrams, and images. The benchmark tested models like GPT-4o, Claude 3.5, and DeepSeek, with none scoring above 10%, revealing their struggle with complex and interdisciplinary problems. The benchmark showed that DeepSeek R1 - a cheaper and less powerful open-source model - outperformed the full o1 model known for its reasoning abilities. HLE was created to address "benchmark saturation," where AI models excel on standard tests but fail on novel challenges. "I wrote 5 questions in the new benchmark that even the top AI models score less than 10% on: Humanity's Last Exam," said Jeremy Nguyen on X. The project involved contributors from diverse academic and research backgrounds. Summer Yue, Scale AI's Director of Research, said the benchmark was designed to push AI models to their reasoning limits. "Starting to see new well-built hard benchmarks in AI since almost everything else has already been exceeded. We now have this (with humanities questions), ARC-AGI 2, and Frontier Math. We also need some benchmarks for new knowledge creation rather than testing known problems," wrote Wharton's Ethan Mollick on X. Last week, there were concerns about OpenAI's involvement with FrontierMath. For context, In December, OpenAI announced its o3 models, reporting 25% accuracy on the EpochAI Frontier Math benchmark, a significant improvement from the previous 2% achieved by other models. Epoch AI recently clarified that OpenAI commissioned them to create 300 math questions for the FrontierMath benchmark. OpenAI owns these questions and has access to their statements and solutions, except for a 50-question private holdout set. The statement also noted that Epoch AI can evaluate and publish results on any model using the FrontierMath problem set but cannot share the questions or answers without OpenAI's written permission. "We can evaluate other models and have done so already. We will publish more results in the next few weeks, perhaps including DeepSeek's," clarified Epoch's Tamay Besiroglu to AIM, addressing FrontierMath's approach to evaluating models from other companies. Regarding the holdout set, Epoch AI explained they are finalising a 50-question set for which OpenAI will only receive the problem statements, not the solutions. AI evaluations largely remain underfunded, and tougher benchmarks are essential as we progress towards AGI. "Going forward, we will ensure all contributors have access to information about industry funding and data access agreements before participating and proactively publicly disclose benchmark sponsorship and data access agreements," read Epoch's statement.

[5]

TechCrunch

Even some of the best AI can't beat this new benchmark

The nonprofit Center for AI Safety (CAIS) and Scale AI, a company that provides a number of data labeling and AI development services, have released a challenging new benchmark for frontier AI systems. The benchmark, called Humanity's Last Exam, includes thousands of crowdsourced questions touching on subjects like mathematics, humanities, and the natural sciences. To make the evaluation tougher, the questions are in multiple formats, including formats that incorporate diagrams and images. In a preliminary study, not a single publicly available flagship AI system managed to score better than 10% on Humanity's Last Exam. CAIS and Scale AI say they plan open up the benchmark to the research community so that researchers can "dig deeper into the variations" and evaluate new AI models.

[6]

Economic Times

When AI passes this test, look out

AI systems are surpassing traditional tests, prompting the creation of "Humanity's Last Exam", a collection of extremely difficult questions across various fields. This new benchmark aims to measure AI's ability to tackle complex problems, though initial results show AI models still struggle. The exam highlights the challenge of evaluating AI's rapid progress and the limitations of standardized tests in capturing true human intelligence.If you're looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that AI systems can't pass. For years, AI systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, SAT-caliber problems in areas like math, science and logic. Comparing the models' scores over time served as a rough measure of AI progress. But AI systems eventually got too good at those tests, so new, harder tests were created -- often with the types of questions graduate students might encounter on their exams. Those tests aren't in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many doctorate-level challenges, limiting those tests' usefulness and leading to a chilling question: Are AI systems getting too smart for us to measure? This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called "Humanity's Last Exam," that they claim is the hardest test ever administered to AI systems. Humanity's Last Exam is the brainchild of Dan Hendrycks, a well-known AI safety researcher and director of the Center for AI Safety. (The test's original name, "Humanity's Last Stand," was discarded for being overly dramatic.) Hendrycks worked with Scale AI, an AI company where he is an adviser, to compile the test, which consists of roughly 3,000 multiple-choice and short answer questions designed to test AI systems' abilities in areas including analytic philosophy and rocket engineering. Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to. Here, try your hand at a question about hummingbird anatomy from the test: Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number. Or, if physics is more your speed, try this one: A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1-T2)/W? (I would print the answers here, but that would spoil the test for any AI systems being trained on this column. Also, I'm far too dumb to verify the answers myself.) The questions on Humanity's Last Exam went through a two-step filtering process. First, submitted questions were given to leading AI models to solve. If the models couldn't answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers. Experts who wrote top-rated questions were paid between $500 and $5,000 per question, as well as receiving credit for contributing to the exam. Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were chosen, all of which he told me were "along the upper range of what one might see in a graduate exam." Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create harder AI tests by a conversation with Elon Musk. (Hendrycks is also a safety adviser to Musk's AI company, xAI.) Musk, he said, raised concerns about the existing tests given to AI models, which he thought were too easy. "Elon looked at the MMLU questions and said, 'These are undergrad level. I want things that a world-class expert could do,'" Hendrycks said. There are other tests trying to measure advanced AI capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by AI researcher François Chollet. But Humanity's Last Exam is aimed at determining how good AI systems are at answering complex questions across a wide variety of academic subjects, giving us what might be thought of as a general intelligence score. "We are trying to estimate the extent to which AI can automate a lot of really difficult intellectual labor," Hendrycks said. Once the list of questions had been compiled, the researchers gave Humanity's Last Exam to six leading AI models, including Google's Gemini 1.5 Pro and Anthropic's Claude 3.5 Sonnet. All of them failed miserably. OpenAI's o1 system scored the highest of the bunch, with a score of 8.3%. Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50% by the end of the year. At that point, he said, AI systems might be considered "world-class oracles," capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure AI's impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science. "You can imagine a better version of this where we can give questions that we don't know the answers to yet, and we're able to verify if the model is able to help solve it for us," said Summer Yue, Scale AI's director of research and an organizer of the exam. Part of what's so confusing about AI progress these days is how jagged it is. We have AI models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Math Olympiad and beating top human programmers on competitive coding challenges. But these same models sometimes struggle with basic tasks, like arithmetic or writing metered poetry. That has given them a reputation as astoundingly brilliant at some things and totally useless at others, and it has created vastly different impressions of how fast AI is improving, depending on whether you're looking at the best or the worst outputs. That jaggedness has also made measuring these models hard. I wrote last year that we need better evaluations for AI systems. I still believe that. But I also believe that we need more creative methods of tracking AI progress that don't rely on standardized tests, because most of what humans do -- and what we fear AI will do better than us -- can't be captured on a written exam. Zhou, the theoretical particle physics researcher who submitted questions to Humanity's Last Exam, told me that while AI models were often impressive at answering complex questions, he didn't consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers. "There's a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher," he said. "Even an AI that can answer these questions might not be ready to help in research, which is inherently less structured."

[7]

The New York Times

A Test So Hard No AI System Can Pass It -- Yet

If you're looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I. systems can't pass. For years, A.I. systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science and logic. Comparing the models' scores over time served as a rough measure of A.I. progress. But A.I. systems eventually got too good at those tests, so new, harder tests were created -- often with the types of questions graduate students might encounter on their exams. Those tests aren't in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests' usefulness and leading to a chilling question: Are A.I. systems getting too smart for us to measure? This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called "Humanity's Last Exam," that they claim is the hardest test ever administered to A.I. systems. Humanity's Last Exam is the brainchild of Dan Hendrycks, a well-known A.I. safety researcher and director of the Center for AI Safety. (The test's original name, "Humanity's Last Stand," was discarded for being overly dramatic.) Mr. Hendrycks worked with Scale AI, an A.I. company where he is an advisor, to compile the test, which consists of roughly 3,000 multiple-choice and short answer questions designed to test A.I. systems' abilities in areas ranging from analytic philosophy to rocket engineering. Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to. Here, try your hand at a question about hummingbird anatomy from the test: Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number. Or, if physics is more your speed, try this one: A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1-T2)/W? (I would print the answers here, but that would spoil the test for any A.I. systems being trained on this column. Also, I'm far too dumb to verify the answers myself.) The questions on Humanity's Last Exam went through a two-step filtering process. First, submitted questions were given to leading A.I. models to solve. If the models couldn't answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers. Experts who wrote top-rated questions were paid between $500 and $5,000 per question, as well as receiving credit for contributing to the exam. Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were chosen, all of which he told me were "along the upper range of what one might see in a graduate exam." Kevin Roose and Casey Newton are the hosts of Hard Fork, a podcast that makes sense of the rapidly changing world of technology. Subscribe and listen. Mr. Hendrycks, who helped create a widely used A.I. test known as Massive Multitask Language Understanding, or M.M.L.U., said he was inspired to create harder A.I. tests by a conversation with Elon Musk. (Mr. Hendrycks is also a safety advisor to Mr. Musk's A.I. company, xAI.) Mr. Musk, he said, raised concerns about the existing tests given to A.I. models, which he thought were too easy. "Elon looked at the M.M.L.U. questions and said, 'These are undergrad level. I want things that a world-class expert could do,'" Mr. Hendrycks said. There are other tests trying to measure advanced A.I. capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by the A.I. researcher François Chollet. But Humanity's Last Exam is aimed at determining how good A.I. systems are at answering complex questions across a wide variety of academic subjects, giving us what might be thought of as a general intelligence score. "We are trying to estimate the extent to which A.I. can automate a lot of really difficult intellectual labor," Mr. Hendrycks said. Once the list of questions had been compiled, the researchers gave Humanity's Last Exam to six leading A.I. models, including Google's Gemini 1.5 Pro and Anthropic's Claude 3.5 Sonnet. All of them failed miserably. OpenAI's o1 system scored the highest of the bunch, with a score of 8.3 percent. (The New York Times has sued OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to A.I. systems. OpenAI and Microsoft have denied those claims.) Mr. Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50 percent by the end of the year. At that point, he said, A.I. systems might be considered "world-class oracles," capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure A.I.'s impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science. "You can imagine a better version of this where we can give questions that we don't know the answers to yet, and we're able to verify if the model is able to help solve it for us," said Summer Yue, Scale AI's director of research and an organizer of the exam. Part of what's so confusing about A.I. progress these days is how jagged it is. We have A.I. models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Math Olympiad and beating top human programmers on competitive coding challenges. But these same models sometimes struggle with basic tasks, like arithmetic or writing metered poetry. That has given them a reputation as astoundingly brilliant at some things and totally useless at others, and it has created vastly different impressions of how fast A.I. is improving, depending on whether you're looking at the best or the worst outputs. That jaggedness has also made measuring these models hard. I wrote last year that we need better evaluations for A.I. systems. I still believe that. But I also believe that we need more creative methods of tracking A.I. progress that don't rely on standardized tests, because most of what humans do -- and what we fear A.I. will do better than us -- can't be captured on a written exam. Mr. Zhou, the theoretical particle physics researcher who submitted questions to Humanity's Last Exam, told me that while A.I. models were often impressive at answering complex questions, he didn't consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers. "There's a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher," he said. "Even an A.I. that can answer these questions might not be ready to help in research, which is inherently less structured."

Twitter

Facebook

Copy Link

Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.

New Benchmark Challenges Top AI Models

Scale AI and the Center for AI Safety (CAIS) have introduced a groundbreaking new AI benchmark called "Humanity's Last Exam" (HLE), designed to test the limits of AI knowledge at the frontiers of human expertise 1 2. This benchmark aims to address the issue of "benchmark saturation," where AI models have been rapidly excelling on standard tests, making it difficult to accurately gauge their capabilities 3.

Comprehensive and Challenging Test Design

The HLE consists of 3,000 questions covering over 100 subjects in mathematics, science, and humanities 1. These questions were carefully selected from an initial pool of 70,000, with input from nearly 1,000 subject expert contributors across 500 institutions in 50 countries 2 4. The benchmark includes multiple-choice and short-answer questions, as well as multi-modal elements incorporating text, diagrams, and images 4.

Performance of Top AI Models

In initial testing, current AI models struggled significantly with the HLE:

OpenAI's GPT-4o: 3.3% accuracy
Anthropic's Claude 3.5 Sonnet: 4.3% accuracy
Google's Gemini 1.5 Pro: 6.2% accuracy
OpenAI's o1: 9.1% accuracy
DeepSeek-R1: 9.4% accuracy 3 5

These results stand in stark contrast to the high scores (often over 90%) that many of these models achieve on other popular benchmarks like MMLU, MATH, and GPQA 1 2.

Implications for AI Development

The poor performance of top AI models on the HLE reveals that there are still significant gaps in AI capabilities when it comes to expert-level knowledge and complex reasoning 2. Dan Hendrycks, co-founder and executive director of CAIS, noted that while it's uncertain how quickly models will advance, the HLE currently demonstrates that there are still expert-level questions that AI models cannot answer 1 2.

Future Outlook

While the current results show a clear limitation in AI capabilities, researchers are cautious about making long-term predictions. Given the rapid pace of AI advancement, it's considered plausible that models could reach over 50% accuracy on the HLE by the end of the year 2. However, the benchmark's creators emphasize that such an achievement would not necessarily indicate autonomous research capabilities or artificial general intelligence 2.

Ongoing Research and Accessibility

CAIS and Scale AI plan to release the HLE dataset to researchers for further study of AI systems and their limitations 1. The benchmark remains open for additional test questions, though cash prizes are no longer being awarded 1. This initiative represents an important step in creating more challenging and comprehensive evaluations of AI capabilities as the field continues to evolve rapidly.

Reference

[1]

ZDNet

|'Humanity's Last Exam' benchmark is stumping top AI models - can you do any better?

[2]

Quartz

|A new AI benchmark called 'Humanity's Last Exam' stumped top models -- for now, at least

[3]

TechRadar

|Could you pass 'Humanity's Last Exam'? Probably not, but neither can AI

[4]

Analytics India Magazine

|Humanity's Last Exam is the New MultiAgent AI Benchmark

[5]

TechCrunch

|Even some of the best AI can't beat this new benchmark

OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

OpenAI's Deep Research achieves a record-breaking 26.6% accuracy on Humanity's Last Exam, a new benchmark designed to test the limits of AI reasoning and problem-solving abilities across diverse fields.

2 Sources

AI Experts Prepare "Humanity's Last Exam" to Challenge Advanced AI Systems

A group of AI researchers is developing a comprehensive test called "Humanity's Last Exam" to assess the capabilities and limitations of advanced AI systems. This initiative aims to identify potential risks and ensure responsible AI development.

9 Sources

Humanity's Last Exam: A Global Effort to Benchmark AI Intelligence

Researchers are developing a comprehensive test to measure AI capabilities, dubbed "Humanity's Last Exam." This collaborative effort aims to create benchmarks for assessing when AI reaches or surpasses human-level intelligence.

2 Sources

New AGI Benchmark Stumps Leading AI Models, Highlighting Gap in General Intelligence

The Arc Prize Foundation introduces ARC-AGI-2, a challenging new test for artificial general intelligence that current AI models, including those from OpenAI and Google, are struggling to solve. The benchmark emphasizes efficiency and adaptability, revealing limitations in current AI capabilities.

5 Sources

FrontierMath: New AI Benchmark Exposes Limitations in Advanced Mathematical Reasoning

Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.

8 Sources

Your one-stop AI hub

The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.

Creative and design

New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI