2 Sources
2 Sources
[1]
AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds
You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence? Looks like we should start taking those degrees back. A new study from researchers at the Oxford Internet Institute suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading. Researchers looked at 445 different benchmark tests used by the industry and other academic outfits to test everything from reasoning capabilities to performance on coding tasks. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared. A big problem that the researchers found is that "Many benchmarks are not valid measurements of their intended targets." That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesn't actually capture a model's capability. For example, the researchers point to the Grade School Math 8K (GSM8K) benchmarking test, which measures a model's performance on grade school-level word-based math problems designed to push the model into "multi-step mathematical reasoning." The GSM8K is advertised as being “useful for probing the informal reasoning ability of large language models.†But the researchers argue that the test doesn't necessarily tell you if a model is engaging in reasoning. "When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no," Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, told NBC News. In the study, the researchers pointed out that GSM8K scores have increased over time, which may point to models getting better at this kind of reasoning and performance. But it may also point to contamination, which happens when benchmark test questions make it into the model's dataset or the model starts "memorizing" answers or information rather than reasoning its way to a solution. When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced "significant performance drops." While this study is among the largest reviews of AI benchmarking, it's not the first to suggest this system of measurement may not be all that it's sold to be. Last year, researchers at Stanford analyzed several popular AI model benchmark tests and found "large quality differences between them, including those widely relied on by developers and policymakers," and noted that most benchmarks "are highest quality at the design stage and lowest quality at the implementation stage." If nothing else, the research is a good reminder that these performance measures, while often well-intended and meant to provide an accurate analysis of a model, can be turned into little more than marketing speak for companies.
[2]
AI's capabilities may be exaggerated by flawed tests, according to new study
Researchers said that the methods used to evaluate AI are oftentimes lacking in rigor. Leila Register Researchers behind a new study say that the methods used to evaluate AI systems' capabilities routinely oversell AI performance and lack scientific rigor. The study, led by researchers at the Oxford Internet Institute in partnership with over three dozen researchers from other institutions, examined 445 leading AI tests, called benchmarks, often used to measure the performance of AI models across a variety of topic areas. AI developers and researchers use these benchmarks to evaluate model abilities and tout technical progress, referencing them to make claims on topics ranging from software engineering performance to abstract-reasoning capacity. However, the paper, released Tuesday, claims these fundamental tests might not be reliable and calls into question the validity of many benchmark results. According to the study, a significant number of top-tier benchmarks fail to define what exactly they aim to test, concerningly reuse data and testing methods from pre-existing benchmarks, and seldom use reliable statistical methods to compare results between models. Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, argued these benchmarks can be alarmingly misleading: "When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure," Mahdi told NBC News. Andrew Bean, a researcher at the Oxford Internet Institute and another lead author of the study, concurred that even reputable benchmarks are too often blindly trusted and deserve more scrutiny. "You need to really take it with a grain of salt when you hear things like 'a model achieves Ph.D. level intelligence,'" Bean told NBC News. "We're not sure that those measurements are being done especially well." Some of the benchmarks examined in the analysis measure specific skills, like Russian or Arabic language abilities, while other benchmarks measure more general capabilities, like spatial reasoning and continual learning. A core issue for the authors was whether a benchmark is a good test of the real-world phenomenon it aims to measure, or what the authors label as "construct validity." Instead of testing a model on an endless series of questions to evaluate its ability to speak Russian, for example, one benchmark reviewed in the study measures a model's performance on nine different tasks, like answering yes-or-no questions using information drawn from Russian-language Wikipedia. However, roughly half of the benchmarks examined in the study fail to clearly define the concepts they purport to measure, casting doubt on benchmarks' ability to yield useful information about the AI models being tested. As an example, in the study the authors showcase a common AI benchmark called Grade School Math 8K (GSM8K), which measures performance on a set of basic math questions. Observers often point to leaderboards on the GSM8K benchmark to show that AI models are highly capable at fundamental mathematical reasoning, and the benchmark's documentation says it is "useful for probing the informal reasoning ability of large language models." Yet correct answers on benchmarks like GSM8K do not necessarily mean the model is actually engaging in mathematical reasoning, study author Mahdi said. "When you ask a first grader what two plus five equals and they say seven, yes, that's the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no." Bean acknowledged that measuring nebulous concepts like reasoning requires evaluating a subset of tasks, and that such selection will invariably be imperfect. "There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure," he said. "With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, 'Great, now I've measured it,'" Bean added. In the new paper, the authors make eight recommendations and provide a checklist to systematize benchmark criteria and improve the transparency and trust in benchmarks. The suggested improvements include specifying the scope of the particular action being evaluated, constructing batteries of tasks that better represent the overall abilities being measured, and comparing models' performance via statistical analysis. Nikola Jurkovic, a member of technical staff at the influential METR AI research center, commended the paper's contributions. "We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful," Jurkovic told NBC News. Tuesday's paper builds on previous research pointing out flaws in many AI benchmarks. Last year, researchers from AI company Anthropic advocated for increased statistical testing to determine whether a model's performance on a specific benchmark really showed a difference in capabilities or was rather just a lucky result given the tasks and questions included in the benchmark. To attempt to increase the usefulness and accuracy of benchmarks, several research groups have recently proposed new series of tests that better measure models' real-world performance on economically meaningful tasks. In late September, OpenAI released a new series of tests that evaluate AI's performance on tasks required for 44 different occupations, in an attempt to better ground claims of AI capabilities in the real world. For example, the tests measure AI's ability to fix inconsistencies in customer invoices Excel spreadsheets for an imaginary sales analyst role, or AI's ability to create a full production schedule for a 60-second video shoot for an imaginary video producer. Dan Hendrycks, director of the Center for AI Safety, and a team of researchers recently released a similar real-world benchmark designed to evaluate AI systems' performance on a range of tasks necessary for the automation of remote work. "It's common for AI systems to score high on a benchmark but not actually solve the benchmark's actual goal," Hendrycks told NBC News. Surveying the broader landscape of AI benchmarks, Mahdi said researchers and developers have many exciting avenues to explore. "We are just at the very beginning of the scientific evaluation of AI systems," Mahdi said.
Share
Share
Copy Link
Oxford researchers analyzed 445 AI benchmarks and found significant flaws in testing methods, suggesting AI capabilities may be overhyped. The study reveals that popular tests lack scientific rigor and may not accurately measure what they claim to test.
A comprehensive new study led by researchers at the Oxford Internet Institute has cast serious doubt on the reliability of AI performance benchmarks that have been used to make bold claims about artificial intelligence capabilities. The research, conducted in partnership with over three dozen researchers from other institutions, represents one of the largest reviews of AI benchmarking to date, examining 445 leading AI tests across various domains
1
2
.The study analyzed benchmarks used to test everything from reasoning capabilities to coding performance, with experts reviewing each benchmarking approach to assess their validity and reliability. These benchmarks are routinely used by AI developers and researchers to evaluate model abilities and make public claims about technical progress, often referenced to support assertions about software engineering performance and abstract-reasoning capacity
2
.The research revealed fundamental flaws in how AI capabilities are measured and reported. According to the study, many benchmarks suffer from what researchers call poor "construct validity" - they fail to accurately test the real-world phenomenon they claim to measure. Adam Mahdi, a senior research fellow at the Oxford Internet Institute and lead author, explained that "when we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure"
2
.
Source: NBC News
A significant finding was that roughly half of the benchmarks examined fail to clearly define the concepts they purport to measure, casting doubt on their ability to yield useful information about AI model performance. Additionally, many benchmarks reuse data and testing methods from pre-existing benchmarks and seldom employ reliable statistical methods to compare results between different models
2
.The researchers highlighted the Grade School Math 8K (GSM8K) benchmark as a prime example of potentially misleading testing. This widely-used benchmark measures AI performance on grade school-level word-based math problems and is advertised as being "useful for probing the informal reasoning ability of large language models." However, the study argues that correct answers don't necessarily indicate genuine mathematical reasoning
1
.Mahdi illustrated this concern with an analogy: "When you ask a first grader what two plus five equals and they say seven, yes, that's the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no"
1
.The study noted that while GSM8K scores have increased over time, this improvement may not reflect better reasoning capabilities but rather contamination - when benchmark test questions make it into the model's training dataset, leading to memorization rather than genuine problem-solving. When researchers tested models on new benchmark questions, they observed "significant performance drops"
1
.Related Stories
The findings have significant implications for how AI capabilities are communicated to the public and policymakers. Andrew Bean, another lead author from the Oxford Internet Institute, cautioned that claims about AI achieving advanced levels of intelligence should be viewed skeptically. "You need to really take it with a grain of salt when you hear things like 'a model achieves Ph.D. level intelligence.' We're not sure that those measurements are being done especially well," Bean told NBC News
2
.This research builds on previous concerns about AI benchmarking. Last year, Stanford researchers analyzed several popular AI model benchmark tests and found "large quality differences between them, including those widely relied on by developers and policymakers," noting that most benchmarks "are highest quality at the design stage and lowest quality at the implementation stage"
1
.To address these issues, the Oxford study provides eight specific recommendations and a checklist to systematize benchmark criteria and improve transparency. The suggested improvements include specifying the scope of particular actions being evaluated, constructing batteries of tasks that better represent overall abilities being measured, and comparing models' performance through rigorous statistical analysis
2
.Summarized by
Navi
09 Apr 2025•Technology

13 Jan 2025•Science and Research

12 Nov 2024•Technology

1
Business and Economy

2
Technology

3
Business and Economy
