5 Sources
5 Sources
[1]
AI benchmarks hampered by bad science
AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful. A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found that only 16 percent of 445 LLM benchmarks for natural language processing and machine learning use rigorous scientific methods to compare model performance. What's more, about half the benchmarks claim to measure abstract ideas like reasoning or harmlessness without offering a clear definition of those terms or how to measure them. In a statement, Andrew Bean, lead author of the study said, "Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to." When OpenAI released GPT-5 earlier this year, the company's pitch rested on a foundation of benchmark scores, such as those from AIME 2025, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard. These tests present AI models with a series of questions and model makers strive to have their bots answer as many as possible. The questions or challenges vary depending upon the focus of the test. For a math-oriented benchmark like AIME 2025, AI models are asked to answer questions like: "[GPT-5] sets a new state of the art across math (94.6 percent on AIME 2025 without tools), real-world coding (74.9 percent on SWE-bench Verified, 88 percent on Aider Polyglot), multimodal understanding (84.2 percent on MMMU), and health (46.2 percent on HealthBench Hard) -- and those gains show up in everyday use," OpenAI said at the time. "With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4 percent without tools." But, as noted in the OII study, "Measuring what Matters: Construct Validity in Large Language Model Benchmarks," 27 percent of the reviewed benchmarks rely on convenience sampling, meaning that the sample data is chosen for the sake of convenience rather than using methods like random sampling or stratified sampling. "For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle." The OII study authors have created a checklist with eight recommendations to make benchmarks better. These include defining the phenomenon being measured, preparing for contamination, and using statistical methods to compare models. Alongside the OII, the other study authors are affiliated with EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University. Bean et al. are far from the first to question the validity of AI benchmark tests. In February, for example, researchers from the European Commission's Joint Research Center published a paper titled, "Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation." As we noted at the time, the authors of that research identified "a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results." At least some of those who design benchmark tests are aware of these concerns. On the same day that the OII study was announced, Greg Kamradt, president of the Arc Prize Foundation, a non-profit that administers an award program based on the Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark, announced, "ARC Prize Verified, a program to increase the rigor of evaluating frontier systems on the ARC-AGI benchmark." Verification and testing rigor are necessary, Kamradt observed, because scores reported by model makers or third-parties may arise from different datasets and prompting methods that make comparison difficult. "This causes confusion in the market and ultimately detracts from our goal of measuring frontier AI progress," Kamradt explained. OpenAI and Microsoft reportedly have their own internal benchmark for determining when AGI - vaguely defined by OpenAI as "AI systems that are generally smarter than humans" - has been achieved. That milestone matters to the two companies because it releases OpenAI from its IP rights and Azure API exclusivity agreement with Microsoft. This AGI benchmark, according to The Information, can be met by OpenAI developing AI systems that generate at least $100 billion in profits. Measuring money turns out to be easier than measuring intelligence. ®
[2]
AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds
You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence? Looks like we should start taking those degrees back. A new study from researchers at the Oxford Internet Institute suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading. Researchers looked at 445 different benchmark tests used by the industry and other academic outfits to test everything from reasoning capabilities to performance on coding tasks. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared. A big problem that the researchers found is that "Many benchmarks are not valid measurements of their intended targets." That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesn't actually capture a model's capability. For example, the researchers point to the Grade School Math 8K (GSM8K) benchmarking test, which measures a model's performance on grade school-level word-based math problems designed to push the model into "multi-step mathematical reasoning." The GSM8K is advertised as being “useful for probing the informal reasoning ability of large language models.†But the researchers argue that the test doesn't necessarily tell you if a model is engaging in reasoning. "When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no," Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, told NBC News. In the study, the researchers pointed out that GSM8K scores have increased over time, which may point to models getting better at this kind of reasoning and performance. But it may also point to contamination, which happens when benchmark test questions make it into the model's dataset or the model starts "memorizing" answers or information rather than reasoning its way to a solution. When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced "significant performance drops." While this study is among the largest reviews of AI benchmarking, it's not the first to suggest this system of measurement may not be all that it's sold to be. Last year, researchers at Stanford analyzed several popular AI model benchmark tests and found "large quality differences between them, including those widely relied on by developers and policymakers," and noted that most benchmarks "are highest quality at the design stage and lowest quality at the implementation stage." If nothing else, the research is a good reminder that these performance measures, while often well-intended and meant to provide an accurate analysis of a model, can be turned into little more than marketing speak for companies.
[3]
AI safety tests are heavily flawed, new study finds -- here's why that could be a huge problem
A new study into the testing procedure behind common AI models has reached some worrying conclusions. The joint investigation between U.S. and U.K researchers examined data from over 440 benchmarking tests used to measure an AI's ability to resolve problems and determine safety parameters. They reported flaws in these tests that undermine the credibility of these models. According to the study, the flaws are due to these benchmarks being built on unclear definitions or weak analytical methods, making it difficult to accurately make assessments of the model's abilities or AI progress. "Benchmarks underpin nearly all claims about advances in AI," said Andrew Bean, lead author of the study. "But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to." Currently, there is no clear regulation on AI models. Instead, they are tested on a wide range of benchmark examinations, such as their ability to solve common logic problems or tests on whether they can be blackmailed. These tests allow AI companies to see where their models fall down and make improvements based on these results in the next iteration. They are also typically the measurement used in policy or regulation decisions. The safety of AI models is a problem that has been up for debate for a while now. In the past, companies like OpenAI and Google have launched their models without completing safety reports. Elsewhere, models have been launched after scoring highly in a range of benchmarking tests, only to fail when released to the public. Google recently withdrew one of its latest models, Gamma, after it made false allegations about a U.S. senator, and similar issues have occurred in the past, such as xAI's Grok hallucinating conspiracy theories. The study was carried out by researchers from the University of California, Berkley and the University of Oxford in the U.K. The team made eight recommendations to AI companies to solve the issues they raised: They also provided a checklist that any benchmarkers can use to test if their own tests are up to scratch. Whether or not the AI companies take these recommendations on board remains to be seen.
[4]
AI's capabilities may be exaggerated by flawed tests, according to new study
Researchers said that the methods used to evaluate AI are oftentimes lacking in rigor. Leila Register Researchers behind a new study say that the methods used to evaluate AI systems' capabilities routinely oversell AI performance and lack scientific rigor. The study, led by researchers at the Oxford Internet Institute in partnership with over three dozen researchers from other institutions, examined 445 leading AI tests, called benchmarks, often used to measure the performance of AI models across a variety of topic areas. AI developers and researchers use these benchmarks to evaluate model abilities and tout technical progress, referencing them to make claims on topics ranging from software engineering performance to abstract-reasoning capacity. However, the paper, released Tuesday, claims these fundamental tests might not be reliable and calls into question the validity of many benchmark results. According to the study, a significant number of top-tier benchmarks fail to define what exactly they aim to test, concerningly reuse data and testing methods from pre-existing benchmarks, and seldom use reliable statistical methods to compare results between models. Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, argued these benchmarks can be alarmingly misleading: "When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure," Mahdi told NBC News. Andrew Bean, a researcher at the Oxford Internet Institute and another lead author of the study, concurred that even reputable benchmarks are too often blindly trusted and deserve more scrutiny. "You need to really take it with a grain of salt when you hear things like 'a model achieves Ph.D. level intelligence,'" Bean told NBC News. "We're not sure that those measurements are being done especially well." Some of the benchmarks examined in the analysis measure specific skills, like Russian or Arabic language abilities, while other benchmarks measure more general capabilities, like spatial reasoning and continual learning. A core issue for the authors was whether a benchmark is a good test of the real-world phenomenon it aims to measure, or what the authors label as "construct validity." Instead of testing a model on an endless series of questions to evaluate its ability to speak Russian, for example, one benchmark reviewed in the study measures a model's performance on nine different tasks, like answering yes-or-no questions using information drawn from Russian-language Wikipedia. However, roughly half of the benchmarks examined in the study fail to clearly define the concepts they purport to measure, casting doubt on benchmarks' ability to yield useful information about the AI models being tested. As an example, in the study the authors showcase a common AI benchmark called Grade School Math 8K (GSM8K), which measures performance on a set of basic math questions. Observers often point to leaderboards on the GSM8K benchmark to show that AI models are highly capable at fundamental mathematical reasoning, and the benchmark's documentation says it is "useful for probing the informal reasoning ability of large language models." Yet correct answers on benchmarks like GSM8K do not necessarily mean the model is actually engaging in mathematical reasoning, study author Mahdi said. "When you ask a first grader what two plus five equals and they say seven, yes, that's the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no." Bean acknowledged that measuring nebulous concepts like reasoning requires evaluating a subset of tasks, and that such selection will invariably be imperfect. "There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure," he said. "With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, 'Great, now I've measured it,'" Bean added. In the new paper, the authors make eight recommendations and provide a checklist to systematize benchmark criteria and improve the transparency and trust in benchmarks. The suggested improvements include specifying the scope of the particular action being evaluated, constructing batteries of tasks that better represent the overall abilities being measured, and comparing models' performance via statistical analysis. Nikola Jurkovic, a member of technical staff at the influential METR AI research center, commended the paper's contributions. "We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful," Jurkovic told NBC News. Tuesday's paper builds on previous research pointing out flaws in many AI benchmarks. Last year, researchers from AI company Anthropic advocated for increased statistical testing to determine whether a model's performance on a specific benchmark really showed a difference in capabilities or was rather just a lucky result given the tasks and questions included in the benchmark. To attempt to increase the usefulness and accuracy of benchmarks, several research groups have recently proposed new series of tests that better measure models' real-world performance on economically meaningful tasks. In late September, OpenAI released a new series of tests that evaluate AI's performance on tasks required for 44 different occupations, in an attempt to better ground claims of AI capabilities in the real world. For example, the tests measure AI's ability to fix inconsistencies in customer invoices Excel spreadsheets for an imaginary sales analyst role, or AI's ability to create a full production schedule for a 60-second video shoot for an imaginary video producer. Dan Hendrycks, director of the Center for AI Safety, and a team of researchers recently released a similar real-world benchmark designed to evaluate AI systems' performance on a range of tasks necessary for the automation of remote work. "It's common for AI systems to score high on a benchmark but not actually solve the benchmark's actual goal," Hendrycks told NBC News. Surveying the broader landscape of AI benchmarks, Mahdi said researchers and developers have many exciting avenues to explore. "We are just at the very beginning of the scientific evaluation of AI systems," Mahdi said.
[5]
Oxford study finds AI benchmarks often exaggerate model performance
Nearly half of all examined benchmarks fail to clearly define their testing goals. A new study reveals that methodologies for evaluating AI systems often overstate performance and lack scientific rigor, raising questions about many benchmark results. Researchers at the Oxford Internet Institute, collaborating with over three dozen institutions, examined 445 leading AI tests, known as benchmarks. These benchmarks measure AI model performance across various topic areas. AI developers use these benchmarks to assess model capabilities and promote technical advancements. Claims about software engineering performance and abstract-reasoning capacity reference these evaluations. The paper, released Tuesday, suggests these fundamental tests may be unreliable. The study found that many top-tier benchmarks fail to define their testing objectives, reuse data and methods from existing benchmarks, and infrequently employ reliable statistical methods for comparing model results. Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author, stated that these benchmarks can be "alarmingly misleading." Mahdi told NBC News, "When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure." Andrew Bean, another lead author, agreed that "even reputable benchmarks are too often blindly trusted and deserve more scrutiny." Bean also told NBC News, "You need to really take it with a grain of salt when you hear things like 'a model achieves Ph.D. level intelligence.' We're not sure that those measurements are being done especially well." Some benchmarks analyzed evaluate specific skills, such as Russian or Arabic language abilities. Others measure general capabilities like spatial reasoning and continual learning. A central concern for the authors was the "construct validity" of a benchmark, which questions if it accurately tests the real-world phenomenon it intends to measure. For instance, one benchmark reviewed in the study measures a model's performance on nine different tasks, including answering yes-or-no questions using information from Russian-language Wikipedia, instead of an endless series of questions to gauge Russian proficiency. Approximately half of the examined benchmarks do not clearly define the concepts they claim to measure. This casts doubt on their ability to provide useful information about the AI models under test. The study highlights Grade School Math 8K (GSM8K), a common AI benchmark for basic math questions. Leaderboards for GSM8K are often cited to show AI models' strong mathematical reasoning. The benchmark's documentation states it is "useful for probing the informal reasoning ability of large language models." However, Mahdi argued that correct answers on benchmarks like GSM8K do not necessarily indicate actual mathematical reasoning. He explained, "When you ask a first grader what two plus five equals and they say seven, yes, that's the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no." Bean acknowledged that measuring abstract concepts like reasoning involves evaluating a subset of tasks, and this selection will inherently be imperfect. He stated, "There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure." He added, "With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, 'Great, now I've measured it.'" The new paper offers eight recommendations and a checklist to systematize benchmark criteria and enhance transparency and trust. Suggested improvements include specifying the scope of the evaluated action, constructing task batteries that better represent overall abilities, and comparing model performance using statistical analysis. Nikola Jurkovic, a member of the technical staff at the METR AI research center, praised the paper's contributions. Jurkovic told NBC News, "We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful." Tuesday's paper builds on previous research that identified flaws in many AI benchmarks. Researchers from AI company Anthropic advocated for increased statistical testing last year. This testing would determine if a model's performance on a benchmark reflected actual capability differences or was a "lucky result" given the tasks and questions. Several research groups have recently proposed new test series to improve benchmark usefulness and accuracy. These new tests better measure models' real-world performance on economically relevant tasks. In late September, OpenAI launched a new series of tests evaluating AI's performance in 44 different occupations. These tests aim to ground AI capability claims more firmly in real-world scenarios. Examples include AI's ability to correct inconsistencies in customer invoices in Excel for a sales analyst role, or to create a full production schedule for a 60-second video shoot for a video producer role. Dan Hendrycks, director of the Center for AI Safety, and a research team recently released a similar real-world benchmark. This benchmark evaluates AI systems' performance on tasks necessary for automating remote work. Hendrycks told NBC News, "It's common for AI systems to score high on a benchmark but not actually solve the benchmark's actual goal." Mahdi concluded that researchers and developers have many avenues to explore in AI benchmark evaluation. He stated, "We are just at the very beginning of the scientific evaluation of AI systems."
Share
Share
Copy Link
A comprehensive Oxford study exposes critical flaws in AI benchmarking methods, finding that 84% of tests lack scientific rigor and many fail to accurately measure claimed capabilities like reasoning and safety.
A comprehensive study from researchers at the Oxford Internet Institute has exposed significant flaws in the methods used to evaluate artificial intelligence systems, raising serious questions about the reliability of benchmark results that underpin most claims about AI progress. The research, conducted in partnership with over three dozen institutions including Stanford University, UC Berkeley, and Yale University, examined 445 leading AI benchmarks and found that only 16 percent use rigorous scientific methods to compare model performance
1
.
Source: NBC
The findings suggest that many widely-cited AI capabilities may be significantly overstated. According to lead author Andrew Bean, "Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to"
1
.The study identified several critical problems with existing AI evaluation methods. Approximately half of the examined benchmarks claim to measure abstract concepts like reasoning or harmlessness without providing clear definitions of these terms or explaining how to measure them effectively
2
. This lack of clarity makes it difficult to determine whether AI models are actually demonstrating the capabilities they appear to possess.
Source: The Register
A particularly concerning finding was that 27 percent of reviewed benchmarks rely on convenience sampling, where sample data is chosen for ease rather than using more rigorous methods like random or stratified sampling
1
. This approach can lead to misleading results that don't accurately reflect real-world performance.The researchers highlighted the Grade School Math 8K (GSM8K) benchmark as an example of how tests can be misleading. While this benchmark is widely used to demonstrate AI models' mathematical reasoning abilities, the study authors argue that correct answers don't necessarily indicate genuine reasoning
4
.Adam Mahdi, a senior research fellow at Oxford and lead author, explained the problem using an analogy: "When you ask a first grader what two plus five equals and they say seven, yes, that's the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no"
4
.This issue is compounded by the problem of data contamination, where benchmark test questions may have been included in the model's training dataset, leading to memorization rather than genuine problem-solving ability
2
.The findings have significant implications for how AI companies market their products. When OpenAI released GPT-5 earlier this year, the company's promotional materials heavily emphasized benchmark scores from tests like AIME 2025, SWE-bench Verified, and MMMU, claiming achievements such as "94.6 percent on AIME 2025 without tools" and "84.2 percent on MMMU"
1
.
Source: Tom's Guide
However, the Oxford study suggests that such claims should be viewed with considerable skepticism. Bean cautioned that consumers and policymakers should "really take it with a grain of salt when you hear things like 'a model achieves Ph.D. level intelligence.' We're not sure that those measurements are being done especially well"
4
.Related Stories
The study's findings are particularly concerning given that these benchmarks are often used to make safety assessments and inform regulatory decisions
3
. With no clear regulation currently governing AI models, benchmark examinations serve as primary tools for evaluating everything from logic problem-solving to resistance to manipulation attempts.Recent incidents underscore these concerns. Google recently withdrew its Gamma model after it made false allegations about a U.S. senator, and similar issues have occurred with other models that scored highly on benchmarks but failed when released to the public
3
.The research team has developed eight specific recommendations to improve benchmarking practices, including defining the phenomenon being measured, preparing for contamination, and using statistical methods to compare models
1
. They also created a comprehensive checklist that benchmarkers can use to evaluate the rigor of their own tests5
.Some industry figures are already responding to these concerns. Greg Kamradt, president of the Arc Prize Foundation, announced "ARC Prize Verified, a program to increase the rigor of evaluating frontier systems on the ARC-AGI benchmark" on the same day the Oxford study was released
1
.Nikola Jurkovic from the METR AI research center praised the paper's contributions, stating that "We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful"
5
.Summarized by
Navi
[1]
[3]
09 Apr 2025•Technology

13 Jan 2025•Science and Research

24 Jan 2025•Science and Research

1
Business and Economy

2
Technology

3
Policy and Regulation
