Curated by THEOUTPOST
On Tue, 25 Mar, 4:03 PM UTC
5 Sources
[1]
A new, challenging AGI test stumps most AI models | TechCrunch
The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Tuesday that it has created a new, challenging test to measure the general intelligence of leading AI models. So far, the new test, called ARC-AGI-2, has stumped most models. "Reasoning" AI models like OpenAI's o1-pro and DeepSeek's R1 score between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash score around 1%. The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares, and generate the correct "answer" grid. The problems were designed to force an AI to adapt to new problems it hasn't seen before. The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, "panels" of these people got 60% of the test's questions right -- much better than any of the models' scores. In a post on X, Chollet claimed ARC-AGI-2 is a better measure of an AI model's actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation's tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on. Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on "brute force" -- extensive computing power -- to find solutions. Chollet previously acknowledged this was a major flaw of ARC-AGI-1. To address the first test's flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization. "Intelligence is not solely defined by the ability to solve problems or achieve high scores," Arc Prize Foundation co-founder Greg Kamradt wrote in a blog post. "The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, 'Can AI acquire [the] skill to solve a task?' but also, 'At what efficiency or cost?'" ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, o3's performance gains on ARC-AGI-1 came with a hefty price tag. The version of OpenAI's o3 model -- o3 (low) -- that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task. The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face's co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, including creativity. Alongside the new benchmark, the Arc Prize Foundation announced a new Arc Prize 2025 contest, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.
[2]
Leading AI models fail new test of artificial general intelligence
The most sophisticated AI models in existence today have scored poorly on a new benchmark designed to measure their progress towards artificial general intelligence (AGI) - and brute-force computing power won't be enough to improve, as evaluators are now taking into account the cost of running the model. There are many competing definitions of AGI, but it is generally taken to refer to an AI that can perform any cognitive task that humans can do. To measure this, the ARC Prize Foundation previously launched a test of reasoning abilities called ARC-AGI-1. Last December, OpenAI announced that its o3 model had scored highly on the test, leading some to ask if the company was close to achieving AGI. But now a new test, ARC-AGI-2, has raised the bar. It is difficult enough that no current AI system on the market can achieve more than a single-digit score out of 100 on the test, while every question has been solved by at least two humans in fewer than two attempts. In a blog post announcing ARC-AGI-2, ARC president Greg Kamradt said the new benchmark was required to test different skills from the previous iteration. "To beat it, you must demonstrate both a high level of adaptability and high efficiency," he wrote. The ARC-AGI-2 benchmark differs from other AI benchmark tests in that it focuses on AI models' abilities to complete simplistic tasks - such as replicating changes in a new image based on past examples of symbolic interpretation - rather than their ability to match world-leading PhD performances. Current models are good at "deep learning", which ARC-AGI-1 measured, but are not as good at the seemingly simpler tasks, which require more challenging thinking and interaction, in ARC-AGI-2. OpenAI's o3-low model, for instance, scores 75.7 per cent on ARC-AGI-1, but just 4 per cent on ARC-AGI-2. The benchmark also adds a new dimension to measuring an AI's capabilities, by looking at its efficiency in problem-solving, as measured by the cost required to complete a task. For example, while ARC paid its human testers $17 per task, it estimates that o3-low costs OpenAI $200 in fees for the same work. "I think the new iteration of ARC-AGI now focusing on balancing performance with efficiency is a big step towards a more realistic evaluation of AI models," says Joseph Imperial at the University of Bath, UK. "This is a sign that we're moving from one-dimensional evaluation tests solely focusing on performance but also considering less compute power." Any model that is able to pass ARC-AGI-2 would need to not just be highly competent, but also smaller and lightweight, says Imperial - with the efficiency of the model being a key component of the new benchmark. This could help address concerns that AI models are becoming more energy-intensive - sometimes to the point of wastefulness - to achieve ever-greater results. However, not everyone is convinced that the new measure is beneficial. "The whole framing of this as it testing intelligence is not the right framing," says Catherine Flick at the University of Staffordshire, UK. Instead, she says these benchmarks merely assess an AI's ability to complete a single task or set of tasks well, which is then extrapolated to mean general capabilities across a series of tasks. Performing well on these benchmarks should not be seen as a major moment towards AGI, says Flick: "You see the media pick up that these models are passing these human-level intelligence tests, where actually they're not; what they are doing is really just responding to a particular prompt accurately." And exactly what happens if or when ARC-AGI-2 is passed is another question - will we need yet another benchmark? "If they were to develop ARC-AGI-3, I'm guessing they would add another axis in the graph denoting [the] minimum number of humans - whether expert or not - it would take to solve the tasks, in addition to performance and efficiency," says Imperial. In other words, the debate over AGI is unlikely to be settled soon.
[3]
ChatGPT, Gemini and Claude all failed to solve a simple test that humans are acing
As artificial intelligence continues to build on its reputation as the smartest thing in the room, it will be oddly therapeutic to hear that one test has it stumped. In fact, this new AI examination system is causing issues for even the most advanced models. ARC-AG2, or to use its more glamorous name, "the Abstraction and Reasoning Corpus", is a new test developed to measure an AI model's reasoning and general problem-solving. It was created by a non-profit called The ARC Prize, which exists to accelerate the development of Artificial General Intelligence (AGI) -- something OpenAI founder Sam Altman has claimed could arrive as soon as this year. Deepseek's R1 model scored just 1.3% on the new test and other similar models like Google's Gemini or Claude's 3.7 Sonnet scored around 1%. ChatGPT's GPT 4.5 model likewise scored 0.8%. So what are they being tested on that is so hard? The test itself included puzzle-like problems where the AI model had to identify visual patterns from a collection of colored squares. Once the pattern is identified, the model then has to select the correct answer. It's a bit like learning grade-school math problems. You cannot simply memorize your way to the answer. Instead, the tasks require a model to apply existing knowledge and models of understanding to completely new problems. By doing this, the test doesn't just look at intelligence as the ability to solve problems or get the highest score. Instead, it's looking at how efficiently AI can adapt, learn, and solve new problems on the fly. This kind of test is designed to force the AI to solve problems it has never seen before, having to acquire new skills that are outside the data they were trained on. Unlike some previous tests, the aim here is to provide something that is easy for humans to complete but hard for AI. Over 400 people were actually asked to take the same test. On average, this human "panel" scored an average of 60% -- far exceeding even the best-performing AI models. This is where the team behind the test believes we should be testing AI. While the likes of ChatGPT, Gemini, and Claude can all outperform humans in a variety of tasks, there are still plenty of areas where humans are better. As the name suggests, this isn't the first version of this test. In 2019, a Google employee created ARC-AG1. This took AI four years to beat and showed the eventual advancement in reasoning for these models. While it could well take the models a few more years to solve this newer test, the team behind it believes it is an important measure to aim for. Once there are no tasks that are easy for humans but hard for AI left, they believe we will have achieved artificial general intelligence -- a version of AI that exceeds human capabilities in all measures.
[4]
A new AI test is outwitting OpenAI, Google models, among others
Humans are still way smarter than AI according to this new AGI benchmark. Credit: karetoria / Getty Images Google, OpenAI, DeepSeek, et al. are nowhere near achieving AGI (Artificial General Intelligence), according to a new benchmark. The Arc Prize Foundation, a nonprofit that measures AGI progress, has a new benchmark that is stumping the leading AI models. The test, called ARC-AGI-2 is the second edition ARC-AGI benchmark that tests models on general intelligence by challenging them to solve visual puzzles using pattern recognition, context clues, and reasoning. This Tweet is currently unavailable. It might be loading or has been removed. According to the ARC-AGI leaderboard, OpenAI's most advanced model o3-low scored 4 percent. Google's Gemini 2.0 Flash and DeepSeek R1 both scored 1.3 percent. Anthropic's most advanced model, Claude 3.7 with an 8K token limit (which refers to the amount of tokens used to process an answer) scored 0.9 percent. The question of how and when AGI will be achieved remains as heated as ever, with various factions bickering about the timeline or whether it's even possible. Anthropic CEO Dario Amodei said it could take as little as two to three years, and OpenAI CEO Sam Altman said "it's achievable with current hardware." But experts like Gary Marcus and Yann LeCun say the technology isn't there yet and it doesn't take an expert to see how fueling AGI hype is advantageous to AI companies seeking major investments. The ARC-AGI benchmark is designed to challenge AI models beyond specialized intelligence by avoiding the memorization trap -- spewing out PhD-level responses without an understanding of what it means. Instead it focuses on puzzles that are relatively easy for humans to solve because of our innate ability to take in new information and make inferences, thus revealing gaps that can't be resolved by simply feeding AI models more data. "Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations. AI systems are already superhuman in many specific domains (e.g., playing Go and image recognition)" read the announcement. "However, these are narrow, specialized capabilities. The 'human-ai gap' reveals what's missing for general intelligence - highly efficiently acquiring new skills." To get a sense of AI models' current limitations, you can take the ARC-AGI test for yourself. And you might be surprised by its simplicity. There's some critical thinking involved, but the ARC-AGI test wouldn't be out of place next to the New York Times crossword puzzle, Wordle, or any of the other popular brain teasers. It's challenging but not impossible and the answer is there in the puzzle's logic, which is something the human brain has evolved to interpret. OpenAI's o3-low model scored 75.7 percent on the first edition of ARC-AGI. By comparison, its 4 percent score on the second edition shows how difficult the test is, but also how there's a lot more work to be done with reaching human level intelligence.
[5]
LLMs Hit a New Low on ARC-AGI-2 Benchmark, Pure LLMs Score 0%
ARC Prize, a non-profit organisation that evaluates the effectiveness of AI models to demonstrate human-like intelligence, has announced the ARC-AGI-2 benchmark. The new benchmark is a successor to the ARC-AGI benchmark released a few years ago. Like its predecessor, the benchmark tests AI models on tasks that are relatively easy models for humans but difficult for artificial systems. The ARC-AGI-2 benchmark poses even greater challenges than its predecessor, as it factors in efficiency (cost-per-task) in addition to performance. The tasks require AI models to interpret symbols beyond their visual patterns, simultaneously apply interrelated rules, and use different rules depending on context. The results revealed that AI models found all of the above tasks challenging. Non-reasoning models, or 'Pure LLMs', scored 0% on the benchmark, while other publicly available reasoning models received single-digit percentage scores of less than 4%. In contrast, a human panel solving the tasks achieved a perfect score of 100%. "AI systems are already superhuman in many specific domains (e.g., playing Go and image recognition.) However, these are narrow, specialised capabilities. The 'human-ai gap' reveals what's missing for general intelligence -- highly efficiently acquiring new skills," the organisation said. OpenAI's unreleased o3 reasoning model achieved the highest score of 4.0%. In the previous ARC-AGI-1 benchmark, it scored 75.7%. However, Sam Altman, CEO of OpenAI, has disclosed that it will not be released as a standalone model. Instead, o3's reasoning capabilities will be integrated into a hybrid GPT-5 model. Besides, there weren't any noteworthy scores from other AI models. Even the recently released Claude 3.7 Sonnet model, often considered the best model for coding, scored 0.7%, whereas the DeepSeek-R1 model scored 1.3%. The leaderboard also outlined the cost (in USD) taken to perform each task. Source: ARC-Prize "All other AI benchmarks focus on superhuman capabilities or specialised knowledge by testing 'PhD++' skills. ARC-AGI is the only benchmark that takes the opposite design choice by focusing on tasks that are relatively easy for humans, yet hard, or impossible, for AI," the organisation added. François Chollet, creator of Keras and a former Google researcher, is one of the creators of the ARC-AGI benchmark. He said it is "the only AI benchmark that measures progress towards general intelligence". Recently, Chollet, along with Zapier co-founder Mike Knoop, launched Ndea, a new research lab dedicated to creating artificial general intelligence (AGI).
Share
Share
Copy Link
The Arc Prize Foundation introduces ARC-AGI-2, a challenging new test for artificial general intelligence that current AI models, including those from OpenAI and Google, are struggling to solve. The benchmark emphasizes efficiency and adaptability, revealing limitations in current AI capabilities.
The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, has unveiled a new benchmark test called ARC-AGI-2, designed to measure the general intelligence of leading AI models 1. This test has proven to be significantly more challenging than its predecessor, with most current AI models struggling to achieve even single-digit scores.
The results of the ARC-AGI-2 test have been eye-opening:
In stark contrast, a human panel achieved an average score of 60% on the test, with some individuals solving all tasks perfectly 15.
The new benchmark introduces several important changes:
The poor performance of leading AI models on ARC-AGI-2 highlights the significant gap between current AI capabilities and human-level general intelligence. Greg Kamradt, co-founder of the Arc Prize Foundation, emphasized that intelligence is not solely about problem-solving ability but also about the efficiency of acquiring and deploying new skills 1.
This benchmark challenges the notion that brute-force computing power alone can lead to AGI. It suggests that fundamental advancements in AI architecture and learning approaches may be necessary to achieve human-like adaptability and efficiency 24.
While many in the tech industry welcome new benchmarks to measure AI progress, some experts question the framing of these tests. Catherine Flick from the University of Staffordshire argues that performing well on such benchmarks should not be seen as a major step towards AGI, as they only assess an AI's ability to complete specific tasks rather than demonstrate true general intelligence 2.
The introduction of ARC-AGI-2 raises questions about the future of AGI evaluation. Joseph Imperial from the University of Bath suggests that future iterations might incorporate additional metrics, such as the minimum number of humans required to solve tasks, alongside performance and efficiency measures 2.
As the debate over AGI continues, the Arc Prize Foundation has announced a new contest challenging developers to reach 85% accuracy on the ARC-AGI-2 test while spending only $0.42 per task 1. This competition aims to drive innovation in both AI performance and efficiency, potentially bringing us closer to the elusive goal of artificial general intelligence.
Reference
[2]
[5]
Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.
7 Sources
7 Sources
A study by USC researchers reveals that AI models, particularly open-source ones, struggle with abstract visual reasoning tasks similar to human IQ tests. While closed-source models like GPT-4V perform better, they still fall short of human cognitive abilities.
4 Sources
4 Sources
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
8 Sources
8 Sources
OpenAI's o3 model scores 85-88% on the ARC-AGI benchmark, matching human-level performance and surpassing previous AI systems, raising questions about progress towards artificial general intelligence (AGI).
6 Sources
6 Sources
As AI models like OpenAI's o3 series surpass human-level performance on various benchmarks, including complex mathematical problems, the need for more sophisticated evaluation methods becomes apparent.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved