Curated by THEOUTPOST
On Mon, 20 Jan, 12:01 AM UTC
4 Sources
[1]
Did OpenAI Cheat on Its Big Math Test? - Decrypt
How intelligent is a model that memorizes the answers before an exam? That's the question facing OpenAI after it unveiled o3 in December, and touted its model's impressive benchmarks. At the time, some pundits hailed it as being almost as powerful as AGI, the level at which artificial intelligence is capable of achieving the same performance as a human on any task required by the user. But money changes everything -- even math tests, apparently. OpenAI's victory lap over its o3 model's stunning 25.2% score on FrontierMath, a challenging mathematical benchmark developed by Epoch AI, hit a snag when it turned out the company wasn't just acing the test -- OpenAI helped write it, too. "We gratefully acknowledge OpenAI for their support in creating the benchmark," Epoch AI wrote in an updated footnote on the FrontierMath whitepaper -- and this was enough to raise some red flags among enthusiasts. Worse, OpenAI had not only funded FrontierMath's development but also had access to its problems and solutions to use as it saw fit. Epoch AI later revealed that OpenAI hired the company to provide 300 math problems, as well as their solutions. "As is typical of commissioned work, OpenAI retains ownership of these questions and has access to the problems and solutions," Epoch said Thursday. Neither OpenAI nor Epoch replied to a request for comment from Decrypt. Epoch has however said that OpenAI signed a contract in advance indicating it would not use the questions and answers in its database to train its o3 model. The Information first broke the story. While an OpenAI spokesperson maintains OpenAI didn't directly train o3 on the benchmark, and the problems were "strongly held out" (meaning OpenAI didn't have access to some of the problems), experts note that access to the test materials could still allow performance optimization through iterative adjustments. Tamay Besiroglu, associate director at Epoch AI, said that OpenAI had initially demanded that its financial relationship with Epoch not be revealed. "We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible," he wrote in a post. "Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much, but not all of the dataset." Tamay said that OpenAI said it wouldn't use Epoch AI's problems and solutions -- but didn't sign any legal contract to make sure that would be enforced. "We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions," he wrote. "However, we have a verbal agreement that these materials will not be used in model training." Fishy as it sounds, Elliot Glazer, Epoch AI's lead mathematician, said he believes OpenAI was true to its word: "My personal opinion is that OAI's score is legit (i.e., they didn't train on the dataset), and that they have no incentive to lie about internal benchmarking performances," he posted on Reddit. The researcher also took to Twitter to address the situation, sharing a link to an online debate about the issue in the online forum Less Wrong. MMLU is a synthetic benchmark, just like FrontierMath, that was created to measure how good models are at multitasking. GSM8K is a set of math problems used to benchmark how proficient LLMs are at math. That makes it impossible to properly assess how powerful or accurate their models truly are. It's like giving a student with a photographic memory a list of the problems and solutions that will be on their next exam; did they reason their way to a solution, or simply spit back the memorized answer? Since these tests are intended to demonstrate that AI models are capable of reasoning, you can see what the fuss is about. "It's actually A VERY BIG ISSUE," RemBrain founder Vasily Morzhakov warned. "The models are tested in their instruction versions on MMLU and GSM8K tests. But the fact that base models can regenerate tests -- it means those tests are already in pre-training." Going forward, Epoch said it plans to implement a "hold out set" of 50 randomly selected problems that will be withheld from OpenAI to ensure genuine testing capabilities. But the challenge of creating truly independent evaluations remains significant. Computer scientist Dirk Roeckmann argued that ideal testing would require "a neutral sandbox which is not easy to realize," adding that even then, there's a risk of "leaking of test data by adversarial humans."
[2]
OpenAI Faces Scrutiny Over o3 Model's FrontierMath Benchmarking Transparency
AI researchers have put OpenAI in the spotlight, especially by the newest released AI model, o3 following its unprecedented performance on the FrontierMath benchmarking test it passed. While OpenAI recently reported 25% accuracy on this particular and difficult mathematician benchmark, issues of openness and access to data are being called into question. EpochAI's FrontierMath benchmark challenges LLMs with mathematical computations -- it is a fairly complex one. The benchmark has been criticized because OpenAI, which acted as a source of technical advice for the activity, had access to key datasets before most participants. This brings about the question of whether the achievements made by OpenAI are real whether they worked hard on developing the models or whether they benefited from previous exposure to the data. EpochAI's associate director, Tamay Besiroglu agreed with this but said that under the terms of the agreement with OpenAI, they could not reveal all of the details. Six mathematicians involved in FrontierMath all regretted participating in it without knowing the access details of the OpenAI. Even though there is an invisible sample for the evaluation, the specialists doubt the process is fair.
[3]
OpenAI Just Pulled a Theranos With o3
OpenAI's o3 benchmark controversy is starting to look like a Theranos moment -- claiming record-breaking performance on EpochAI's FrontierMath benchmark while having access to much of the test data, and funding the same. Epoch AI's associate director, Tamay Besiroglu admitted they were contractually restricted from disclosing OpenAI's involvement, while six contributing mathematicians revealed they were unaware of the exclusive access. Besiroglu said, "We made a mistake in not being more transparent about OpenAI's involvement. "He revealed that the company was restricted from disclosing the partnership until the o3 model was launched. "Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. We own this error and are committed to doing better in the future," he added. Besiroglu also acknowledged that OpenAI had access to a large portion of the FrontierMath problems and solutions. However, an 'unseen-by-OpenAI hold-out set' helped verify the model's capabilities. "Six mathematicians who significantly contributed to the FrontierMath benchmark confirmed this is true - that they are unaware that OpenAI will have exclusive access to this benchmark (and others won't). Most express they are not sure they would have contributed had they known," revealed Carina Hong, a PhD candidate at Stanford, on X. AI experts like Gary Marcus are questioning the legitimacy of OpenAI's claims, comparing the situation directly to Theranos. In December last year, when OpenAI announced its new o3 family of models, the company claimed that the o3 achieved an impressive 25% accuracy on the EpochAI Frontier Math benchmark. It was a huge leap over the previous high scores of just 2% from other powerful models. The benchmark assigns LLMs to solve mathematical problems of unprecedented difficulty. In an exclusive interaction with AIM earlier, Besiroglu revealed that Epoch AI significantly reduces data contamination issues by producing novel problems in the benchmark. He also said, "The [benchmark] data is private, so it's not used for training." A user on LessWrong discovered that the latest version of FrontierMath's research paper explaining the benchmark included a footnote stating, "We gratefully acknowledge OpenAI for their support in creating the benchmark." Mikhail Samin, executive director at the AI Governance and Safety Institute, said on X that "OpenAI has a history of misleading behaviour- from deceiving its own board to secret non-disparagement agreements that former employees had to sign- so I guess this shouldn't be too surprising." OpenAI also claimed the o3 model scored almost 90% on the ARC-AGI benchmark, exceeding human performance. The benchmark is said to be the "only AI benchmark that measures progress towards general intelligence." However, François Chollet, creator of the ARC-AGI benchmark, stated, "I don't believe this is AGI -- there are still easy ARC-AGI-1 tasks that o3 can't solve." Since the model's launch, Marcus has always been scpetical of the results. Earlier, he also said "Not one person outside of OpenAI has evaluated o3's robustness across different types of problems."
[4]
AI benchmarking organization criticized for waiting to disclose funding from OpenAI | TechCrunch
An organization developing math benchmarks for AI didn't disclose that it had received funding from OpenAI until relatively recently, drawing allegations of impropriety from some in the AI community. Epoch AI, a nonprofit primarily funded by Open Philanthropy, a research and grantmaking foundation, revealed on December 20 that OpenAI had supported the creation of FrontierMath. FrontierMath, a test with expert-level problems designed to measure an AI's mathematical skills, was one of the benchmarks OpenAI used to demo its upcoming flagship AI, o3. In a post on the forum LessWrong, a contractor for Epoch AI going by the username "Meemi" says that many contributors to the FrontierMath benchmark weren't informed of OpenAI's involvement until it was made public. "The communication about this has been non-transparent," Meemi wrote. "In my view Epoch AI should have disclosed OpenAI funding, and contractors should have transparent information about the potential of their work being used for capabilities, when choosing whether to work on a benchmark." On social media, some users raised concerns that the secrecy could erode FrontierMath's reputation as an objective benchmark. In addition to backing FrontierMath, OpenAI had access to many of the problems and solutions in the benchmark -- a fact Epoch AI didn't divulge prior to December 20, when o3 was announced. In a reply to Meemi's post, Tamay Besiroglu, associate director of Epoch AI and one of the organization's co-founders, asserted that the integrity of FrontierMath hadn't been compromised, but admitted that Epoch AI "made a mistake" in not being more transparent. "We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible," Besiroglu wrote. "Our mathematicians deserved to know who might have access to their work. Even though we were contractually limited in what we could say, we should have made transparency with our contributors a non-negotiable part of our agreement with OpenAI." Besiroglu added that while OpenAI has access to FrontierMath, it has a "verbal agreement" with Epoch AI not to use FrontierMath's problem set to train its AI. (Training an AI on FrontierMath would be akin to teaching to the test.) Epoch AI also has a "separate holdout set" that serves as an additional safeguard for independent verification of FrontierMath benchmark results, Besiroglu said. "OpenAI has ... been fully supportive of our decision to maintain a separate, unseen holdout set," Besiroglu wrote. However, muddying the waters, Epoch AI lead mathematician Ellot Glazer noted in a post on Reddit that Epoch AI hasn't be able to independently verify OpenAI's FrontierMath o3 results. "My personal opinion is that [OpenAI's] score is legit (i.e., they didn't train on the dataset), and that they have no incentive to lie about internal benchmarking performances," Glazer said. "However, we can't vouch for them until our independent evaluation is complete." The saga is yet another example of the challenge of developing empirical benchmarks to evaluate AI -- and securing the necessary resources for benchmark development without creating the perception of conflicts of interest.
Share
Share
Copy Link
OpenAI's impressive performance on the FrontierMath benchmark with its o3 model is under scrutiny due to the company's involvement in creating the test and having access to problem sets, raising questions about the validity of the results and the transparency of AI benchmarking.
OpenAI recently unveiled its o3 model, claiming an impressive 25.2% accuracy on the FrontierMath benchmark, a challenging mathematical test developed by Epoch AI 1. This score far surpassed previous high scores of just 2% from other powerful models, marking a significant leap in AI capabilities 3.
The celebration of o3's performance was short-lived as it came to light that OpenAI had played a significant role in the creation of the FrontierMath benchmark. Epoch AI, the nonprofit behind FrontierMath, revealed that OpenAI had funded the benchmark's development and had access to a large portion of the problems and solutions 1.
This disclosure raised concerns about the validity of OpenAI's results and the transparency of the benchmarking process. Tamay Besiroglu, associate director at Epoch AI, admitted that they were contractually restricted from disclosing OpenAI's involvement until the o3 model was launched 3.
The lack of transparency extended to the mathematicians who contributed to FrontierMath. Six mathematicians confirmed they were unaware that OpenAI would have exclusive access to the benchmark 3. This revelation led to regret among some contributors who might not have participated had they known about OpenAI's involvement 2.
OpenAI maintains that it didn't directly train o3 on the benchmark and that some problems were "strongly held out" 1. Epoch AI acknowledged the mistake in not being more transparent about OpenAI's involvement and committed to implementing a "hold out set" of 50 randomly selected problems to be withheld from OpenAI for future testing 1 4.
This controversy highlights the challenges in creating truly independent evaluations for AI models. Experts argue that ideal testing would require a neutral sandbox, which is difficult to realize 1. The incident has drawn comparisons to the Theranos scandal, with some AI experts questioning the legitimacy of OpenAI's claims 3.
The FrontierMath controversy underscores the complexities of AI benchmarking and the need for greater transparency in the development and testing of AI models. It raises important questions about how to balance the need for resources in benchmark development with maintaining the integrity and objectivity of the evaluation process 4.
Reference
[2]
[3]
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
8 Sources
8 Sources
As AI models like OpenAI's o3 series surpass human-level performance on various benchmarks, including complex mathematical problems, the need for more sophisticated evaluation methods becomes apparent.
2 Sources
2 Sources
Scale AI and the Center for AI Safety have introduced a challenging new AI benchmark called 'Humanity's Last Exam', which has proven difficult for even the most advanced AI models, highlighting the current limitations of artificial intelligence.
7 Sources
7 Sources
The Reflection 70B AI model, initially hailed as a breakthrough, is now embroiled in controversy. Its creators face accusations of fraud, raising questions about the model's legitimacy and the future of AI development.
2 Sources
2 Sources
OpenAI CEO Sam Altman admits the company has been on the "wrong side of history" regarding open-source AI development, as Chinese startup DeepSeek's success sparks industry-wide debate on AI strategies and market dynamics.
14 Sources
14 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved