Curated by THEOUTPOST
On Mon, 21 Apr, 4:01 PM UTC
3 Sources
[1]
OpenAI's o3 AI model scores lower on a benchmark than the company initially implied | TechCrunch
A discrepancy between first- and third-party benchmark results for OpenAI's o3 AI model is raising questions about the company's transparency and model testing practices. When OpenAI unveiled o3 in December, the company claimed the model could answer just over a fourth of questions on FrontierMath, a challenging set of math problems. That score blew the competition away -- the next-best model managed to answer only around 2% of FrontierMath problems correctly. "Today, all offerings out there have less than 2% [on FrontierMath]," Mark Chen, chief research officer at OpenAI, said during a livestream. "We're seeing [internally], with o3 in aggressive test-time compute settings, we're able to get over 25%." As it turns out, that figure was likely an upper bound, achieved by a version of o3 with more computing behind it than the model OpenAI publicly launched last week. Epoch AI, the research institute behind FrontierMath, released results of its independent benchmark tests of o3 on Friday. Epoch found that o3 scored around 10%, well below OpenAI's highest claimed score. OpenAI has released o3, their highly anticipated reasoning model, along with o4-mini, a smaller and cheaper model that succeeds o3-mini. We evaluated the new models on our suite of math and science benchmarks. Results in thread! pic.twitter.com/5gbtzkEy1B -- Epoch AI (@EpochAIResearch) April 18, 2025 That doesn't mean OpenAI lied, per se. The benchmark results the company published in December show a lower-bound score that matches the score Epoch observed. Epoch also noted its testing setup likely differs from OpenAI's, and that it used an updated release of FrontierMath for its evaluations. "The difference between our results and OpenAI's might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time [computing], or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private)," wrote Epoch. According to a post on X from the ARC Prize Foundation, an organization that tested a pre-release version of o3, the public o3 model "is a different model [...] tuned for chat/product use," corroborating Epoch's report. "All released o3 compute tiers are smaller than the version we [benchmarked]," wrote ARC Prize. Generally speaking, bigger compute tiers can be expected to achieve better benchmark scores. Granted, the fact that the public release of o3 falls short of OpenAI's testing promises is a bit of a moot point, since the company's o3-mini-high and o4-mini models outperform o3 on FrontierMath, and OpenAI plans to debut a more powerful o3 variant, o3-pro, in the coming weeks. It is, however, another reminder that AI benchmarks are best not taken at face value -- particularly when the source is a company with services to sell. Benchmarking "controversies" are becoming a common occurrence in the AI industry as vendors race to capture headlines and mindshare with new models. In January, Epoch was criticized for waiting to disclose funding from OpenAI until after the company announced o3. Many academics who contributed to FrontierMath weren't informed of OpenAI's involvement until it was made public. More recently, Elon Musk's xAI was accused of publishing misleading benchmark charts for its latest AI model, Grok 3. Just this month, Meta admitted to touting benchmark scores for a version of a model that differed from the one the company made available to developers.
[2]
OpenAI's o3 claimed 25%, independent test says "try 10"
OpenAI's o3 AI model scored lower on the FrontierMath benchmark than the company initially implied, according to independent tests by Epoch AI, the research institute behind FrontierMath. When OpenAI unveiled o3 in December, it claimed the model could answer 25% of FrontierMath questions, significantly outperforming other models. Epoch AI's tests found that o3 scored around 10% on FrontierMath. The discrepancy may be due to differences in testing setups or the version of o3 used. OpenAI's chief research officer, Mark Chen, had stated that o3 achieved over 25% in "aggressive test-time compute settings." Epoch noted that OpenAI's published benchmark results showed a lower-bound score that matches the 10% score Epoch observed. The public o3 model is "tuned for chat/product use" and has smaller compute tiers than the version tested by OpenAI in December, according to the ARC Prize Foundation, which tested a pre-release version of o3. OpenAI's Wenda Zhou explained that the production o3 model is "more optimized for real-world use cases" and speed, which may result in benchmark disparities. OpenAI's o3-mini-high and o4-mini models outperform o3 on FrontierMath. The company plans to release a more powerful o3 variant, o3-pro, in the coming weeks. This incident highlights the need for caution when interpreting AI benchmarks, particularly when they are used to promote commercial products. The AI industry has seen several benchmarking controversies recently. In January, Epoch was criticized for not disclosing funding from OpenAI until after the company announced o3. xAI was accused of publishing misleading benchmark charts for its Grok 3 model, and Meta admitted to touting benchmark scores for a different version of a model than the one available to developers.
[3]
OpenAI's o3 Scores Less Than Half of Claimed Score in FrontierMath Test
OpenAI told ARC Prize that the released o3 model is different from the one tested by the organisation OpenAI's o3 artificial intelligence (AI) model, which was released last week, is underperforming on a specific benchmark. Epoch AI, the company behind the FrontierMath benchmark, highlighted that the publicly available version of the o3 AI model scored 10 percent on the test, a much lower value than the company's claim at launch. The San Francisco-based AI firm's chief research officer, Mark Chen, had said that the model scored 25 percent on the test, creating a new record. However, the discrepancy does not mean that OpenAI lied about the metric. In December 2024, OpenAI held a livestream on YouTube and other social media platforms, announcing the o3 AI model. At the time, the company highlighted the improved set of capabilities in the large language model (LLM), in particular, its improved performance in reasoning-based queries. One of the ways the company exemplified the claim was by sharing the model's benchmark scores across different popular tests. One of these tests was FrontierMath, created by Epoch AI. The mathematical test is known for being challenging and tamper-proof, as more than 70 mathematicians developed the test, and the problems are all new and unpublished. Notably, till December, no AI model has solved more than nine percent of the questions in a single attempt. However, at the time of launch, Chen claimed that o3 was able to set a new record by scoring 25 percent on the test. External verification of the performance was not possible at the time, as the model was not available in the public domain. After o3 and o4-mini were launched last week, Epoch AI made a post on X (formerly known as Twitter), claiming that the o3 model, in fact, scored 10 percent on the test. While a score of 10 percent also makes the AI model the highest ranking on the test, the number is less than half of what the company claimed. The post has led to several AI enthusiasts talking about the validity of the benchmark scores. The discrepancy does not mean that OpenAI lied about the performance of its AI model. Instead, the AI firm's unreleased model likely used higher compute to get that score. However, the commercial version of the model was likely fine-tuned to be more power efficient, and in that process, some of its performance was toned down. Separately, ARC Prize, an organisation behind the ARC-AGI benchmark test, which tests an AI model's general intelligence, also posted on X about the discrepancy. The post confirmed, "The released o3 is a different model from what we tested in December 2024." The company claimed that the released o3 model's compute tiers are smaller than the version it tested. However, it did confirm that o3 was not trained on ARC-AGI data, even at the pre-training stage. ARC Prize said that it will re-test the released o3 AI model and publish the updated results. The company will also re-test the o4-mini model, and label the prior scores as "preview". It is not certain that the released version of o3 will underperform on this test as well.
Share
Share
Copy Link
OpenAI's o3 AI model scores 10% on the FrontierMath benchmark, significantly lower than the 25% initially claimed. The discrepancy raises questions about AI benchmark transparency and testing practices in the industry.
OpenAI's recently released o3 AI model has scored significantly lower on the FrontierMath benchmark than initially claimed, sparking discussions about transparency and testing practices in the AI industry. Epoch AI, the research institute behind FrontierMath, reported that the publicly available o3 model achieved a score of around 10% on their challenging mathematical test, far below the 25% OpenAI had suggested during the model's unveiling in December 1.
The discrepancy between OpenAI's claim and Epoch AI's findings can be attributed to several factors:
Different model versions: The ARC Prize Foundation confirmed that the released o3 model differs from the version tested in December 2024, being "tuned for chat/product use" 2.
Compute settings: OpenAI's chief research officer, Mark Chen, had mentioned achieving over 25% in "aggressive test-time compute settings," suggesting that the publicly released model may use less computational power 1.
Testing methodology: Epoch AI noted that their testing setup likely differs from OpenAI's, and they used an updated version of FrontierMath for their evaluations 1.
This incident highlights broader issues within the AI industry:
Benchmark reliability: The discrepancy underscores the need for caution when interpreting AI benchmarks, especially those used to promote commercial products 2.
Transparency concerns: The AI community has raised questions about OpenAI's transparency and model testing practices 1.
Recurring controversies: Similar benchmarking controversies have recently affected other AI companies, including xAI and Meta, suggesting an industry-wide issue 2.
While the lower score on FrontierMath may seem disappointing, OpenAI has addressed the situation:
Wenda Zhou from OpenAI explained that the production o3 model is "more optimized for real-world use cases" and speed, which may result in benchmark disparities 2.
OpenAI plans to release a more powerful variant, o3-pro, in the coming weeks, which may potentially achieve higher benchmark scores 1.
Interestingly, OpenAI's o3-mini-high and o4-mini models already outperform o3 on FrontierMath, suggesting continued progress in their AI development 1.
FrontierMath has gained significance in the AI community due to its challenging nature:
The test is considered tamper-proof, developed by over 70 mathematicians with new, unpublished problems 3.
Prior to o3, no AI model had solved more than 9% of FrontierMath questions in a single attempt 3.
Despite the lower-than-claimed score, o3's 10% achievement still represents a notable advancement in AI reasoning capabilities 3.
As the AI industry continues to evolve rapidly, this incident serves as a reminder of the complexities involved in benchmarking and the importance of transparent communication about AI model capabilities and limitations.
Reference
[1]
[2]
[3]
OpenAI's impressive performance on the FrontierMath benchmark with its o3 model is under scrutiny due to the company's involvement in creating the test and having access to problem sets, raising questions about the validity of the results and the transparency of AI benchmarking.
4 Sources
4 Sources
OpenAI has significantly reduced the time allocated for safety testing of its new AI models, raising concerns about potential risks and the company's commitment to thorough evaluations.
4 Sources
4 Sources
OpenAI unveils o3 and o3 Mini models with impressive capabilities in reasoning, coding, and mathematics, sparking debate on progress towards Artificial General Intelligence (AGI).
35 Sources
35 Sources
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
8 Sources
8 Sources
OpenAI's O3 reasoning AI model, initially praised for its performance on the ARC-AGI benchmark, is now estimated to cost significantly more than originally thought, raising questions about the economic viability of advanced AI models.
5 Sources
5 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved