Curated by THEOUTPOST
On Tue, 12 Nov, 12:03 AM UTC
8 Sources
[1]
A new math benchmark just dropped and leading AI models can solve 'less than 2%' of its problems... oh dear
Sometimes I forget there's a whole other world out there where AI models aren't just used for basic tasks such as simple research and quick content summaries. Out in the land of bigwigs, they're instead being used to help with everything from financial analysis to scientific research. That's why their mathematical capabilities are so important -- plus it's a general marker of reasoning capabilities. Which is why mathematical benchmarks exist. Benchmarks such as FrontierMath, which its maker, Epoch AI, has just dropped and which is putting LLMs through their paces with "hundreds of original, expert-crafted mathematics problems designed to evaluate advanced reasoning capabilities in AI systems" (via Ars Technica). While today's AI models don't tend to struggle with other mathematical benchmarks such as GSM-8k and MATH, according to Epoch AI, "they solve less than 2% of FrontierMath problems, revealing a substantial gap between current AI capabilities and the collective prowess of the mathematics community". To be clear, these are hard problems. As in, so hard that they "typically require hours or days for expert mathematicians to solve", ranging "from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory". What's so different about this benchmark is that solving these mathematical problems requires "extended chains of precise reasoning, with each step building exactly on what came before". AI models have traditionally not been great at extended reasoning in general, let alone for super-advanced math. This makes sense when you consider what AI models, at bottom, are doing. Using LLMs as an example, these are trained on tons of data to figure out what each next word would most likely be based on this data. Although of course there's plenty of room for directing the model more towards different words, the process is essentially probabilistic. Of late, however, we've seen AI models apply their probabilistic "thinking" in more of a directed fashion towards intermediary steps of this "thinking". In other words, we've seen a move towards AI models that attempt to reason through their thinking, rather than just jumping to a probabilistic conclusion. There's now a version of ChatGPT-4o, for instance, that uses reasoning (and you better make sure you don't question it). It's also telling that you can now potentially be awarded for giving a question that AI can't answer for "humanity's last exam". Of course, these individual steps of reasoning might themselves be arrived at probabilistically -- and could we expect any more from a non-sentient algorithm? -- but they do seem to be engaging in what we flesh-and-bloodies after the fact consider to be "reasoning". We're clearly a way off from having these AI models achieve the reasoning capabilities of our best and brightest, though. We can see that now that we have a mathematical benchmark capable of really putting them to the test -- 2% isn't great, is it? (And take that, robots.) Regarding the FrontierMath problems, Fields Medalist Terence Tao tells Epoch AI, "I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages..." While AI models might not be able to crack these difficult problems just yet, the FrontierMath benchmark looks to serve as a good litmus test for future improvements, ensuring the models aren't just spewing out mathematical nonsense that only experts could verify as such. We must, in the end, remember that AI is not truth-aiming, however closely we humans aim its probabilistic reasoning at results that tend towards the truth. The philosopher in me must ask: Without it having an inner life aiming towards truth, can truth actually exist for the AI, even if it spews it out? Truth for us, yes, but for the AI? I suspect not, and that's why benchmarks like these will be crucial moving forwards into this new industrial revolution, or whatever they're calling it these days.
[2]
New secret math benchmark stumps AI models and PhDs alike
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI. The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete. FrontierMath's performance results, revealed in a preprint research paper, paint a stark picture of current AI model limitations. Even with access to Python environments for testing and verification, top models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly. This contrasts with their high performance on simpler math benchmarks -- many models now score above 90 percent on tests like GSM8K and MATH. The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination. Many existing AI models are trained on other test problem datasets, allowing the AI models to easily solve the problems and appear more generally capable than they actually are. Many experts cite this as evidence that current large language models (LLMs) are poor generalist learners. Epoch AI says it developed FrontierMath through collaboration with over 60 mathematicians from leading institutions. The problems underwent peer review to verify correctness and check for ambiguities. About 1 in 20 problems needed corrections during the review process, a rate comparable to other major machine learning benchmarks. The problems in the new set span multiple mathematical disciplines, from computational number theory to abstract algebraic geometry. And they are reportedly difficult to solve. Really, really difficult.
[3]
Testing AI systems on hard math problems shows they still perform very poorly
A team of AI researchers and mathematicians affiliated with several institutions in the U.S. and the U.K. has developed a math benchmark that allows scientists to test the ability of AI systems to solve exceptionally difficult math problems. Their paper is posted on the arXiv preprint server. Over the past few years, LLMs such as ChatGPT have grown ever more sophisticated and therefore can at times appear to have a high level of intelligence. But there is one area where they fall short -- solving difficult math problems. As developers of AI systems work to improve the math skills of their models, they have developed benchmarks to serve as a means to test their progress. Two of the most popular are MATH and GSM8K. Over time, several LLMs have improved to the extent that they are able to score up to 90% on these tests. But, as the team on this new effort noted, the difficulty level of such benchmarks is not that high. They decided a new benchmark was needed, and so they created one they named FrontierMath. To begin, the research team delved deep into the math world, reaching out to some of the brightest minds in the field. They asked them to provide some truly difficult math problems and got back hundreds of them in reply. Such problems, the researchers note, are not only unique (they have not been published before) but they also require a deep level of understanding of mathematics. Some take humans several days to solve. They also cover a wide range of topics, from number theory to algebraic geometry. Because of that breadth, brute force will not work. Neither will making educated guesses. To score well on the FrontierMath benchmark, an AI system would have to have creativity, insight and what the research team describes as "deep domain expertise." Testing thus far has demonstrated the difficulty found in FrontierMath. AIs that have scored well on traditional benchmarks have not been able to score any higher than 2%.
[4]
AI's math problem: FrontierMath benchmark shows how far technology still has to go
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Artificial intelligence systems may be good at generating text, recognizing images, and even solving basic math problems -- but when it comes to advanced mathematical reasoning, they are hitting a wall. A groundbreaking new benchmark, FrontierMath, is exposing just how far today's AI is from mastering the complexities of higher mathematics. Developed by the research group Epoch AI, FrontierMath is a collection of hundreds of original, research-level math problems that require deep reasoning and creativity -- qualities that AI still sorely lacks. Despite the growing power of large language models like GPT-4o and Gemini 1.5 Pro, these systems are solving fewer than 2% of the FrontierMath problems, even with extensive support. "We collaborated with 60+ leading mathematicians to create hundreds of original, exceptionally challenging math problems," Epoch AI announced in a post on X.com. "Current AI systems solve less than 2%." The goal is to see how well machine learning models can engage in complex reasoning, and so far, the results have been underwhelming. A Higher Bar for AI FrontierMath was designed to be much tougher than the traditional math benchmarks that AI models have already conquered. On benchmarks like GSM-8K and MATH, leading AI systems now score over 90%, but those tests are starting to approach saturation. One major issue is data contamination -- AI models are often trained on problems that closely resemble those in the test sets, making their performance less impressive than it might seem at first glance. "Existing math benchmarks like GSM8K and MATH are approaching saturation, with AI models scoring over 90% -- partly due to data contamination," Epoch AI posted on X.com. "FrontierMath significantly raises the bar." In contrast, the FrontierMath problems are entirely new and unpublished, specifically crafted to prevent data leakage. These aren't the kinds of problems that can be solved with basic memorization or pattern recognition. They often require hours or even days of work from human mathematicians, and they cover a wide range of topics -- from computational number theory to abstract algebraic geometry. Mathematical reasoning of this caliber demands more than just brute-force computation or simple algorithms. It requires what Fields Medalist Terence Tao calls "deep domain expertise" and creative insight. After reviewing the benchmark, Tao remarked, "These are extremely challenging. I think that in the near term, basically the only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages." Why Is Math So Hard for AI? Mathematics, especially at the research level, is a unique domain for testing AI. Unlike natural language or image recognition, math requires precise, logical thinking, often over many steps. Each step in a proof or solution builds on the one before it, meaning that a single error can render the entire solution incorrect. "Mathematics offers a uniquely suitable sandbox for evaluating complex reasoning," Epoch AI posted on X.com. "It requires creativity and extended chains of precise logic -- often involving intricate proofs -- that must be meticulously planned and executed, yet allows for objective verification of results." This makes math an ideal testbed for AI's reasoning capabilities. It's not enough for the system to generate an answer -- it has to understand the structure of the problem and navigate through multiple layers of logic to arrive at the correct solution. And unlike other domains, where evaluation can be subjective or noisy, math provides a clean, verifiable standard: either the problem is solved or it isn't. But even with access to tools like Python, which allows AI models to write and run code to test hypotheses and verify intermediate results, the top models are still falling short. Epoch AI evaluated six leading AI systems, including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, and found that none could solve more than 2% of the problems. The Experts Weigh In The difficulty of the FrontierMath problems has not gone unnoticed by the mathematical community. In fact, some of the world's top mathematicians were involved in crafting and reviewing the benchmark. Fields Medalists Terence Tao, Timothy Gowers, and Richard Borcherds, along with International Mathematical Olympiad (IMO) coach Evan Chen, shared their thoughts on the challenge. "All of the problems I looked at were not really in my area and all looked like things I had no idea how to solve," Gowers said. "They appear to be at a different level of difficulty from IMO problems." The problems are designed not just to be hard but also to resist shortcuts. Each one is "guessproof," meaning it's nearly impossible to solve without doing the mathematical work. As the FrontierMath paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1% chance of guessing correctly without the proper reasoning. This approach prevents AI models from using simple pattern matching or brute-force approaches to stumble upon the right answer. The problems are specifically designed to test genuine mathematical understanding, and that's why they're proving so difficult for current systems. The Long Road Ahead Despite the challenges, FrontierMath represents a critical step forward in evaluating AI's reasoning capabilities. As the authors of the research paper note, "FrontierMath represents a significant step toward evaluating whether AI systems possess research-level mathematical reasoning capabilities." This is no small feat. If AI can eventually solve problems like those in FrontierMath, it could signal a major leap forward in machine intelligence -- one that goes beyond mimicking human behavior and starts to approach something more akin to true understanding. But for now, AI's performance on the benchmark is a reminder of its limitations. While these systems excel in many areas, they still struggle with the kind of deep, multi-step reasoning that defines advanced mathematics. Matthew Barnett, an AI researcher, captured the significance of FrontierMath in a series of tweets. "The first thing to understand about FrontierMath is that it's genuinely extremely hard," Barnett wrote. "Almost everyone on Earth would score approximately 0%, even if they're given a full day to solve each problem." Barnett also speculated on what it might mean if AI eventually cracks the benchmark. "I claim that, once FrontierMath is completely solved, humans will be living alongside an entirely distinct set of intelligent beings," he wrote. "We will be sharing this Earth with artificial minds that are, in an important sense, just as smart as we are." While that day may still be far off, FrontierMath provides a clear line in the sand -- a way to measure progress toward true AI intelligence. As AI systems continue to improve, their performance on this benchmark will be closely watched by researchers, mathematicians, and technologists alike. What's Next for AI and Mathematics? Epoch AI plans to expand FrontierMath over time, adding more problems and refining the benchmark to ensure it remains a relevant and challenging test for future AI systems. The researchers also plan to conduct regular evaluations, tracking how AI models perform as they evolve. In the meantime, FrontierMath offers a fascinating glimpse into the limits of artificial intelligence. It shows that while AI has made incredible strides in recent years, there are still areas -- like advanced math -- where human expertise reigns supreme. But if and when AI does break through, it could represent a paradigm shift in our understanding of machine intelligence. For now, though, the message is clear: when it comes to solving the hardest problems in math, AI still has a lot to learn.
[5]
GPT-4 and Gemini Scored Less Than 2 Percent on This New AI Benchmark
The company said older benchmarks do not truly test AI capabilities Epoch AI, a California-based research institute launched a new artificial intelligence (AI) benchmark last week. Dubbed FrontierMath, the new AI benchmark tests large language models (LLMs) on their capability of reseasoning and mathematical problem-solving. The AI firm claims that existing math benchmarks are not very useful due to factors like data contamination and AI models scoring very high scores on them. Epoch AI claims that even the leading LLMs have scored less than two percent on the new benchmark. In a post on X (formerly known as Twitter), the AI firm explained that it collaborated with more than 60 mathematicians to create hundreds of origins and unpublished math problems. Epoch AI claims that these questions would take even mathematicians hours to solve. The reason behind developing the new benchmark was cited as the limitations with existing benchmarks such as GSM8K and MATH, where AI models generally score a high point. The company claimed that the high scores achieved by LLMs are largely due to data contamination. This means the questions somehow were already fed into the AI models, resulting in them easily solving the questions. FrontierMath solves the problem by including new problems that are unique and have not been published anywhere, mitigating the risks associated with data contamination. Further, the benchmark includes a wide range of questions including computationally intensive problems in number theory, real analysis, and algebraic geometry, as well as topics such as Zermelo-Fraenkel set theory. The AI firm says all the questions are "guess proof", meaning they cannot be solved accidentally without strong reasoning. Epoch AI highlighted that to measure AI's aptitude, benchmarks should be created on creative problem-solving where the AI has to maintain reasoning over multiple steps. Notably, many industry veterans believe that the existing benchmarks are not sufficient to correctly measure how advanced an AI model is. Responding to the new benchmark in a post, Noam Brown, an OpenAI researcher who was behind the company's o1 model welcomed the new benchmark and said, "I love seeing a new eval with such low pass rates for frontier models."
[6]
Never Mind Coding -- o1 is Downright Awful at Maths!
It's not just OpenAI's o1 -- no LLM in the world is anywhere close to cracking the toughest problems in mathematics (yet). A few days ago, Epoch AI released FrontierMath, a new benchmark to evaluate the mathematical capabilities of large language models. The results revealed a startling low for these babies -- the LLMs all sucked at maths by far more than expected. Several debates have long occurred regarding the effectiveness of benchmarks. In a research paper, Apple stated that despite their performance in benchmarks, LLMs aren't genuinely good at mathematical reasoning, and their output results from pattern recognition and replication of steps from training data. Even OpenAI mentioned that they do not want to benchmark o1 on MATH and GSM8K since the evaluation method is quite outdated, and most LLMs will easily output high scores. "Recent frontier models do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models," said OpenAI in a blog post. In light of such concerns, FrontierMath assigns LLMs to solve mathematical problems of unprecedented difficulty. According to Epoch AI, these problems demand hours of work from human scientists and mathematicians. Moreover, the problems in the benchmark are all new and unpublished, alleviating any concerns of 'contamination' from existing benchmarks. They were developed in collaboration with 60 mathematicians. So, how does the benchmark work exactly, and what does it say about LLMs' capabilities today? If there's any evidence that LLMs are years behind human intelligence, FrontierMath is the best bet. The benchmark results revealed that LLMs solved a mere 2% of the problems correctly. On the other hand, LLMs solved over 60% of the problems on benchmarks like Omni-MATH, MathVista, and GSM8-K. "Each problem demands hours of work from expert mathematicians. Even the most advanced AI systems today, including GPT-4 and Gemini, solve less than 2% of them," revealed Epoch AI. Several mathematicians praised the benchmark and indicated that it contained one of the most complex sets of problems. "To understand expert perspectives on FrontierMath's difficulty and relevance, we interviewed several prominent mathematicians...They unanimously characterised the problems as exceptionally challenging, requiring deep domain expertise and significant time investment to solve," mentioned Epoch AI in the research paper. "These are extremely challenging. I think that in the near term, basically, the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages," Terence Tao, the Fields Medalist in 2006, said. Moreover, Epoch AI also said that testing LLMs on mathematical benchmarks that are hard to perform on may be a better way to assess their overall capabilities. This differs from several other methods which require subjective judgement. "To understand and measure the progress in artificial intelligence, we need carefully designed benchmarks that can assess how well AI systems engage in complex scientific reasoning. "Mathematics offers a unique opportunity for this assessment -- it requires extended chains of precise reasoning, with each step building exactly on what came before," said Epoch AI in the research paper. The test problems contained integer-based answers, and the solutions were automatically verified using Python scripts. Besides, Epoch AI claims that these problems were "guess proof", which means that all of the problems had to be fully solved to arrive at the solution. "As a rule of thumb, we require that there should not be a greater than 1% chance of guessing the correct answer without doing most of the work that one would need to do to "correctly" find the solution," said EpochAI. "I think they will resist AI for several years at least," said Tao, asserting that we're years away from a powerful LLM that can solve these problems. Interestingly, Andrej Karpathy, founder of Eureka Labs, took to X and compared the benchmark to Moravec's paradox. "This is Moravec's paradox in disguise, which observed 30+ years ago that what is easy/hard for humans can be non-intuitively very different to what is easy/hard for computers," he said. While OpenAI claims that o1 is the best LLM to date, it did not perform well on the mathematical benchmark -- just like coding. While Claude 3.5 and Gemini 1.5 Pro beat o1 in the results, their performance wasn't notable either. As mentioned, none of these models were able to solve more than 2% of the problems. However, there is an important takeaway. To perform a fair evaluation, the researchers tested the LLMs repeatedly on four of the problems that they all solved correctly. They mentioned that o1 Preview performed the strongest among repeated trials. "When re-evaluating these problems that were solved at least once, o1-preview demonstrated the strongest performance across repeated trials." said Epoch AI in the research paper. That is certainly a ray of hope. Perhaps o1's strong reasoning capabilities aid a consistent output, preventing the model from any significant deviations. Moreover, it will be interesting to see how o1 performs on FrontierMath once it is out of preview and released with all its capabilities. Or will it be taken over by the likes of Gemini 2.0? Epoch AI's future plans include developing more such tests and implementing other methods for better assessment. "For example, we will test the effects of increasing the token limit, allowing models to reason for longer and run more experiments per problem. We also plan to conduct multiple runs for each model-problem pair, enabling us to report statistics and confidence intervals across attempts," Epoch AI wrote. However, assessing these models on such tough benchmarks isn't everything. "I also think it's an interesting challenge to create evals for all the 'easy' stuff that is secretly hard. Very long-context windows, coherence, autonomy, common sense, multimodal I/O that works... "How do we build good 'menial job' evals? The kinds of things you'd expect from any entry-level intern on your team," said Karpathy.
[7]
OpenAI o1 Can't Do Maths, But Excels at Making Excuses
It's not just OpenAI's o1 -- no LLM in the world is anywhere close to cracking the toughest problems in mathematics (yet). A few days ago, Epoch AI released FrontierMath, a new benchmark to evaluate the mathematical capabilities of large language models. The results revealed a startling low for these babies -- the LLMs all sucked at maths, albeit on problems that were harder to solve than ever. Several debates have long occurred regarding the effectiveness of benchmarks. In a research paper, Apple stated that despite their performance in benchmarks, LLMs aren't genuinely good at mathematical reasoning, and their output results from pattern recognition and replication of steps from training data. Even OpenAI mentioned that they do not want to benchmark o1 on MATH and GSM8K since the evaluation method is quite outdated, and most LLMs will easily output high scores. "Recent frontier models do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models," said OpenAI in a blog post. In light of such concerns, FrontierMath assigns LLMs to solve mathematical problems of unprecedented difficulty. According to Epoch AI, these problems demand hours of work from human scientists and mathematicians. Moreover, the problems in the benchmark are all new and unpublished, alleviating any concerns of 'contamination' from existing benchmarks. They were developed in collaboration with 60 mathematicians. So, how does the benchmark work exactly, and what does it say about LLMs' capabilities today? If there's any evidence that LLMs are years behind human intelligence, FrontierMath is the best bet. The benchmark results revealed that LLMs solved a mere 2% of the problems correctly. On the other hand, LLMs solved over 60% of the problems on benchmarks like Omni-MATH, MathVista, and GSM8-K. "Each problem demands hours of work from expert mathematicians. Even the most advanced AI systems today, including GPT-4 and Gemini, solve less than 2% of them," revealed Epoch AI. Several mathematicians praised the benchmark and indicated that it contained one of the most complex sets of problems. "To understand expert perspectives on FrontierMath's difficulty and relevance, we interviewed several prominent mathematicians...They unanimously characterised the problems as exceptionally challenging, requiring deep domain expertise and significant time investment to solve," mentioned Epoch AI in the research paper. "These are extremely challenging. I think that in the near term, basically, the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages," Terence Tao, the Fields Medalist in 2006, said. Moreover, Epoch AI also said that testing LLMs on mathematical benchmarks that are hard to perform on may be a better way to assess their overall capabilities. This differs from several other methods which require subjective judgement. "To understand and measure the progress in artificial intelligence, we need carefully designed benchmarks that can assess how well AI systems engage in complex scientific reasoning. "Mathematics offers a unique opportunity for this assessment -- it requires extended chains of precise reasoning, with each step building exactly on what came before," said Epoch AI in the research paper. The test problems contained integer-based answers, and the solutions were automatically verified using Python scripts. Besides, Epoch AI claims that these problems were "guess proof", which means that all of the problems had to be fully solved to arrive at the solution. "As a rule of thumb, we require that there should not be a greater than 1% chance of guessing the correct answer without doing most of the work that one would need to do to "correctly" find the solution," said EpochAI. "I think they will resist AI for several years at least," said Tao, asserting that we're years away from a powerful LLM that can solve these problems. Interestingly, Andrej Karpathy, founder of Eureka Labs, took to X and compared the benchmark to Moravec's paradox. "This is Moravec's paradox in disguise, which observed 30+ years ago that what is easy/hard for humans can be non-intuitively very different to what is easy/hard for computers," he said. While OpenAI claims that o1 is the best LLM to date, it did not perform well on the mathematical benchmark -- just like coding. While Claude 3.5 and Gemini 1.5 Pro beat o1 in the results, their performance wasn't notable either. As mentioned, none of these models were able to solve more than 2% of the problems. However, there is an important takeaway. To perform a fair evaluation, the researchers tested the LLMs repeatedly on four of the problems that they all solved correctly. They mentioned that o1 Preview performed the strongest among repeated trials. "When re-evaluating these problems that were solved at least once, o1-preview demonstrated the strongest performance across repeated trials." said Epoch AI in the research paper. That is certainly a ray of hope. Perhaps o1's strong reasoning capabilities aid a consistent output, preventing the model from any significant deviations. Moreover, it will be interesting to see how o1 performs on FrontierMath once it is out of preview and released with all its capabilities. Or will it be taken over by the likes of Gemini 2.0? Epoch AI's future plans include developing more such tests and implementing other methods for better assessment. "For example, we will test the effects of increasing the token limit, allowing models to reason for longer and run more experiments per problem. We also plan to conduct multiple runs for each model-problem pair, enabling us to report statistics and confidence intervals across attempts," Epoch AI wrote. However, assessing these models on such tough benchmarks isn't everything. "I also think it's an interesting challenge to create evals for all the 'easy' stuff that is secretly hard. Very long-context windows, coherence, autonomy, common sense, multimodal I/O that works... "How do we build good 'menial job' evals? The kinds of things you'd expect from any entry-level intern on your team," said Karpathy.
[8]
OpenAI is So Doomed if Inference Time Scaling for o1 Fails
But Sam Altman and his team are taking their biggest risk ever to bring AGI next year. OpenAI's progress from GPT-4 to Orion has slowed, The information reported recently. According to the report, although OpenAI has completed only 20% of Orion's training, it is already on par with GPT-4 in intelligence, task fulfilment, and question-answering abilities. While Orion outperforms previous models, the quality improvement is less dramatic than the leap from GPT-3 to GPT-4. This led many to wonder -- Have LLM improvements hit a wall? No one seemed more thrilled about it than the most celebrated AI critic, Gary Marcus, who promptly posted on X, "Folks, game over. I won. GPT is hitting a period of diminishing returns, just like I said it would." However, it appears Uncle Gary may have celebrated a bit too early. "With all due respect, the article introduces a new AI scaling law that could replace the old one. The sky isn't falling," said one of the article's authors quickly responded to Marcus and clarified, Similarly, OpenAI researchers were quick to correct the narrative, asserting that the article inaccurately portrays the progress of OpenAI's upcoming models -- or rather misleading. "There are now two key dimensions of scaling for models like the o1 series -- training time and inference time," said Adam Goldberg, a founding member of OpenAI's go-to-market (GTM) team. He explained that while traditional scaling laws focusing on pre-training larger models for longer are still relevant, there's now another important factor. "Aspect of scale remains foundational. However, the introduction of this second scaling dimension is set to unlock amazing new capabilities," he added. He was elaborating on OpenAI researcher Noam Brown's earlier statement claiming that o1 is trained with reinforcement learning (RL) to "think" before responding via a private chain of thought. "The longer it thinks, the better it performs on reasoning tasks," he had said. This, Brown explained, introduces a new dimension to scaling. "We're no longer bottlenecked by pretraining. We can now scale inference compute as well," he added. Jason Wei, also a researcher at OpenAI, defended o1 and explained the difference in the chain of thought before and after o1. He explained that the traditional chain-of-thought reasoning used by AI models like GPT was more of a mimicry than a true "thinking" process. He said the model would often reproduce reasoning paths it encountered during its pretraining, like solutions to math problems or other tasks. He added that the o1 system introduces a more robust and authentic "thinking" process. In this paradigm, the chain of thought reflects more of an internal reasoning process, similar to how humans think. He explained that instead of simply spitting out an answer, the model engages in an "inner monologue" or "stream of consciousness," where it actively considers and evaluates options. "You can see the model backtracking; it says things like 'alternatively, let's try' or 'wait, but'," he added. This back-and-forth process is a more dynamic and thoughtful approach to solving problems. "People underestimate how powerful test-time compute is: compute for longer, in parallel, or fork and branch arbitrarily -- like cloning your mind 1,000 times and picking the best thoughts," said Peter Welinder, VP of product at OpenAI. Earlier, when OpenAI released o1-mini and o1-preview, they mentioned in their blog post that o1's performance consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). Regarding inference time scaling, they said, "The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them." It appears that OpenAI has exhausted all available data for pre-training the model and is now exploring new methods to improve o1. According to The Information report, Orion was partially trained on AI-generated data (or synthetic data) produced by other OpenAI models, including GPT-4 and the recently released reasoning models. Jensen to the Rescue: When NVIDIA CEO Jensen Huang recently said "We Are Going to Take Everybody with Us," he really meant it. In a recent podcast with No Priors, Huang shared that one of the major challenges NVIDIA is currently facing in computing is inference time scaling, which involves generating tokens at incredibly low latency. Huang explained that, in the future, AI systems will need to perform tasks like tree search, chain of thought, and mental simulations, reflecting on their own answers. The model would prompt itself and generate text internally, all while responding in real-time, ideally within a second. This approach subtly points to the capabilities of the o1 system. While others remain uncertain, OpenAI chief Sam Altman is confident that artificial general intelligence (AGI) is closer than many think. In a recent interview with Y Combinator's Garry Tan, Altman suggested that AGI could emerge as soon as 2025. "I think we are going to get there faster than people expect," he said, underscoring OpenAI's accelerated progress. OpenAI has yet to release o1 fully. While it may not perform well in math and coding at this stage, it doesn't mean it won't improve over time. Many believe that o1 could be the first commercial application of System 2 thinking. In EpochAI's FrontierMath benchmark, which tests LLMs on some of the hardest and unpublished problems in math, it was revealed that only 2% of these problems were successfully solved by LLMs. While all models showed poor performance, the o1 preview showed a positive sign, as it was able to consistently solve problems correctly in repeated testing. Apple recently published a paper titled 'Understanding the Limitations of Mathematical Reasoning in Large Language Models', which said that the current LLMs can't reason. The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and, thus, not reliable for testing the reasoning abilities of LLMs. Surprisingly, on this benchmark, OpenAI's o1 demonstrated "strong performance on various reasoning and knowledge-based benchmarks" according to the researchers. However, the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions. Subbarao Kambhampati, a computer science and AI professor at Arizona State University said that some of the claims of LLMs being capable of reasoning are "exaggerated". He argued that LLMs require more tools to handle System 2 tasks (reasoning), for which techniques like fine-tuning or chain of thought are not adequate. "When we develop AI systems that can actually reason, they will involve deep learning (as one of two major components, the other being discrete search). Some people might argue that this 'proves' deep learning can reason," said François Chollet, the creator of Keras. "But that's not true. It will prove that deep learning alone isn't enough and that we need to combine it with discrete search," Chollet added. Pointing to the inclusion of Gemini in AlphaProof, he described it as "basically cosmetic and for marketing purposes". He argued that this reflects a wider trend -- using the 'LLM' brand name as a blanket term for all AI progress, even though much of it is unrelated to LLMs. When OpenAI released o1, claiming that the model thinks and reasons, Hugging Face CEO Clem Delangue was not impressed. "Once again, an AI system is not 'thinking'; it's 'processing,' 'running predictions'... just like Google or computers do," said Delangue, adding that OpenAI is "selling cheap snake oil". However, all is not lost for OpenAI, Google DeepMind recently published a paper titled 'Chain of Thought Empowers Transformers to Solve Inherently Serial Problems'. While sharing his research on X, Denny Zhou mentioned, "We have mathematically proven that Transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed." This echoes AI researcher Andrej Karpathy's recent remarks on next-token prediction frameworks, suggesting that they could become a universal tool for solving a wide range of problems, far beyond just alone text or language.
Share
Share
Copy Link
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
Epoch AI, a California-based research institute, has introduced FrontierMath, a groundbreaking benchmark designed to test the advanced mathematical reasoning capabilities of large language models (LLMs). This new benchmark has exposed significant limitations in current AI systems, with even leading models solving less than 2% of the problems 1.
Existing mathematical benchmarks like GSM-8k and MATH have become less effective in evaluating AI capabilities, with top models scoring over 90% on these tests 2. Epoch AI argues that these high scores are partly due to data contamination, where AI models have been trained on similar problems, leading to artificially inflated performance 4.
FrontierMath consists of hundreds of original, expert-crafted mathematics problems that are:
The benchmark was developed in collaboration with over 60 mathematicians from leading institutions. The problems underwent peer review to ensure correctness and check for ambiguities, with about 1 in 20 problems requiring corrections during the review process 2.
Despite their high performance on simpler math benchmarks, top AI models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly on FrontierMath, even with access to Python environments for testing and verification 2.
Fields Medalist Terence Tao commented on the difficulty of the problems, stating that solving them would likely require a combination of a semi-expert (like a graduate student in a related field), modern AI, and various algebra packages 4.
FrontierMath's results highlight the current limitations of AI in complex reasoning tasks. The benchmark serves as a crucial tool for evaluating genuine mathematical understanding and creativity in AI systems, rather than simple pattern matching or brute-force approaches 4.
While AI models have made significant strides in various domains, FrontierMath demonstrates that there is still a substantial gap between current AI capabilities and human-level mathematical reasoning. This benchmark sets a new standard for evaluating AI progress in advanced problem-solving and may guide future developments in AI research and applications 3.
Reference
[2]
[5]
As AI models like OpenAI's o3 series surpass human-level performance on various benchmarks, including complex mathematical problems, the need for more sophisticated evaluation methods becomes apparent.
2 Sources
A recent study by Apple researchers exposes significant flaws in the mathematical reasoning capabilities of large language models (LLMs), challenging the notion of AI's advanced reasoning skills and raising questions about their real-world applications.
17 Sources
Google DeepMind's AI models, AlphaProof and AlphaGeometry2, have demonstrated remarkable mathematical prowess by solving complex problems at a level equivalent to a silver medal in the International Mathematical Olympiad (IMO).
8 Sources
A study by USC researchers reveals that AI models, particularly open-source ones, struggle with abstract visual reasoning tasks similar to human IQ tests. While closed-source models like GPT-4V perform better, they still fall short of human cognitive abilities.
4 Sources
Researchers are exploring mathematical techniques to address the problem of AI chatbots generating false information. These approaches aim to make language models more reliable and truthful in their responses.
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved