Curated by THEOUTPOST
On Sun, 13 Oct, 12:00 AM UTC
17 Sources
[1]
Apple Says AI's Math Skills Fall Short | PYMNTS.com
Recent findings from Apple researchers have cast doubt on the mathematical prowess of large language models (LLMs), challenging the notion that artificial intelligence (AI) is on the brink of human-like reasoning. In a test of 20 state-of-the-art LLMs, performance on grade-school math problems plummeted when questions were slightly modified or irrelevant information was added, Apple found. Accuracy dropped by up to 65.7%, revealing a startling fragility in AI systems when faced with tasks requiring robust logical reasoning. This weakness could have far-reaching implications for commerce relying on AI for complex decision-making. Financial institutions, in particular, may need to reassess their use of AI in tasks involving intricate calculations or risk assessment. At the heart of this debate lies the artificial general intelligence (AGI) concept -- the holy grail of AI that could match or surpass human intelligence across various tasks. While some tech leaders predict AGI's imminent arrival, these findings suggest we might be further from that goal than previously thought. "Any real-world application that requires reasoning of the sort that can be definitively verified (or not) is basically impossible for an LLM to get right with any degree of consistency," Selmer Bringsjord, professor at Rensselaer Polytechnic Institute, told PYMNTS. Bringsjord draws a clear line between AI and traditional computing: "What a calculator can do on your smartphone is something an LLM can't do -- because if someone really wanted to make sure that the result of a calculation you called for from your iPhone is correct, it would be possible, ultimately and invariably, for Apple to verify or falsify that result." Not all experts view the limitations exposed in the Apple paper as equally problematic. "The limitations outlined in this study are likely to have minimal impact on real-world applications of LLMs. This is because most real-world applications of LLMs do not require advanced mathematical reasoning," Aravind Chandramouli, head of AI at data science company Tredence, told PYMNTS. Potential solutions exist, such as fine-tuning or prompt-engineering pre-trained models for specific domains. Specialized models like WizardMath and MathGPT, designed for mathematical tasks, could enhance AI's capabilities in areas requiring rigorous logical thinking. The debate extends beyond math to a fundamental question: Do these AIs truly understand anything? This issue is central to discussions about AGI and machine cognition. "LLMs have no understanding whatsoever of what they do. They are just searching for sub-linguistic patterns from among those that are in the stored data that are statistically analogous to those in that data," Bringsjord said. Said Chandramouli: "While their coherent answers can create the illusion of understanding, the ability to map statistical correlations in data does not imply that they genuinely understand the tasks they are performing." This insight highlights the challenge of distinguishing between sophisticated pattern recognition and true comprehension in AI systems. Eric Bravick, CEO of The Lifted Initiative, acknowledges current limitations but sees potential solutions. "Large language models (LLMs) are not equipped to perform mathematical calculations. They don't understand mathematics," he said. However, he suggests that pairing LLMs with specialized AI sub-systems could lead to more accurate results. "When paired with specialized AI sub-systems that are trained in mathematics, they can retrieve accurate answers rather than generating them based on their statistical models trained for language production," Bravick said. Emerging technologies like retrieval-augmented generation (RAG) systems and multimodal AI could address current limitations in AI reasoning. The field of AI continues to evolve rapidly, with LLMs showing remarkable language processing and generation capabilities. However, their struggles with logical reasoning and mathematical understanding reveal significant work still needed to achieve AGI. Careful evaluation and testing of AI systems remain crucial, particularly for high-stakes applications requiring reliable reasoning. Researchers and developers may find promising paths in approaches like fine-tuning, specialized models and multimodal AI systems as they work to bridge the gap between current AI capabilities and the envisioned robust, general intelligence.
[2]
Apple Engineers Show How Flimsy AI 'Reasoning' Can Be
The new frontier in large language models is the ability to "reason" their way through problems. New research from Apple says it's not quite what it's cracked up to be. For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems. The fragility highlighted in these new results helps support previous research suggesting that LLMs' use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. "Current LLMs are not capable of genuine logical reasoning," the researchers hypothesize based on these results. "Instead, they attempt to replicate the reasoning steps observed in their training data." In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" -- currently available as a preprint paper -- the six Apple researchers start with GSM8K's standardized set of more than 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values -- so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation. This approach helps avoid any potential "data contamination" that can result from the static GSM8K questions being fed directly into an AI model's training data. At the same time, these incidental changes don't alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K. Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names. This kind of variance -- both within different GSM-Symbolic runs and compared to GSM8K results -- is more than a little surprising since, as the researchers point out, "the overall reasoning steps needed to solve a question remain the same." The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any "formal" reasoning but are instead "attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data." Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI's ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That's a pretty high success rate using either benchmark, regardless of whether or not the model itself is using "formal" reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems). The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding "seemingly relevant but ultimately inconsequential statements" to the questions. For this "GSM-NoOp" benchmark set (short for "no operation"), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that "five of them [the kiwis] were a bit smaller than average." Adding in these red herrings led to what the researchers termed "catastrophic performance drops" in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple "pattern matching" to "convert statements to operations without truly understanding their meaning," the researchers write.
[3]
Top "Reasoning" AI Models Can be Brought to Their Knees With an Extremely Simple Trick
A team of Apple researchers has found that advanced AI models' alleged ability to "reason" isn't all it's cracked up to be. "Reasoning" is a word that's thrown around a lot in the AI industry these days, especially when it comes to marketing the advancements of frontier AI language models. OpenAI, for example, recently dropped its "Strawberry" model, which the company billed as its next-level large language model (LLM) capable of advanced reasoning. (That model has since been renamed just "o1.") But marketing aside, there's no agreed-upon industrywide definition for what reasoning exactly means. Like other AI industry terms, for example, "consciousness" or "intelligence," reasoning is a slippery, ephemeral concept; as it stands, AI reasoning can be chalked up to an LLM's ability to "think" its way through queries and complex problems in a way that resembles human problem-solving patterns. But that's a notoriously difficult thing to measure. And according to the Apple scientists' yet-to-be-peer-reviewed study, frontier LLMs' alleged reasoning capabilities are way flimsier than we thought. For the study, the researchers took a closer look at the GSM8K benchmark, a widely-used dataset used to measure AI reasoning skills made up of thousands of grade school-level mathematical word problems. Fascinatingly, they found that just slightly altering given problems -- switching out a number or a character's name here or adding an irrelevant detail there -- caused a massive uptick in AI errors. In short: when researchers made subtle changes to GSM8K questions that didn't impact the mechanics of the problem, frontier AI models failed to keep up. And this, the researchers argue, suggests that AI models aren't actually reasoning like humans, but are instead engaging in more advanced pattern-matching based on existing training data. "We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning," the researchers write. "Instead, they attempt to replicate the reasoning steps observed in their training data." A striking example of such an exploit is a mathematical reasoning problem involving kiwis, which reads as follows: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have? Of course, how small or large any of these kiwis are is irrelevant to the task at hand. But as the scientists' work showed, the majority of AI models routinely -- and erroneously -- incorporated the extraneous detail into reasoning processes, ultimately resulting in errors. Take this response given by OpenAI's "o1-mini" model, a "cost-efficient" version of the AI formerly codenamed "Strawberry," which mistakenly finds that the smaller kiwis should be subtracted from the eventual total: Sunday: Double the number he picked on Friday, which is 2 × 44 = 88 kiwis However, on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday's kiwis) - 5 (smaller kiwis) = 83 kiwis Now, summing up the kiwis from all three days: 44 (Friday) + 58 (Saturday) + 83 (Sunday) = 185 kiwis Oliver has a total of 185 kiwis. Overall, researchers saw the AI models' accuracy drop from 17.5 percent to a staggering 65.7 percent, depending on the model. And in an even more simplistic test, researchers found that just switching out details like proper nouns or numbers caused a significant decrease in a model's ability to correctly answer the question, with accuracy dropping from 0.3 percent to nearly ten percent across 20 top reasoning models. "LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered," lead study author and Apple research scientist Mehrdad Farajtabar wrote last week in a thread on X-formerly-Twitter. "Would a grade-school student's math test score vary by [about] ten percent if we only changed the names?" The study's findings not only call the intelligence of frontier AI models into question, but also the accuracy of the current methods we use to grade and market those models. After all, if you memorize a few sentences of a language phonetically, you haven't actually learned a language. You just know what a few words are supposed to sound like. "Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable -- especially in AI safety, alignment, education, healthcare, and decision-making systems," Farajtabar continued in the X thread. "Our findings emphasize the need for more robust and adaptable evaluation methods." "Developing models that move beyond pattern recognition to true logical reasoning," he added, "is the next big challenge for the AI community."
[4]
Apple's latest study proves that AI can't even solve basic grade-school math problems
Several Apple researchers have confirmed what had been previously thought to be the case regarding AI -- that there are serious logical faults in its reasoning, especially when it comes to basic grade school math. According to a recently published paper from six Apple researchers, 'GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models', the mathematical "reasoning" that advanced large language models (LLMs) supposedly employ can be extremely inaccurate and fragile when those methods are changed. The researchers started with the GSM8K's standardized set of 8,000 grade-school level mathematics word problems, a common benchmark for testing LLMs. Then they slightly altered the wording without changing the problem logic and dubbed it the GSM-Symbolic test. The first set saw a performance drop between 0.3 percent and 9.2 percent. In contrast, the second set (which added in a red herring statement that had no bearing on the answer) saw "catastrophic performance drops" between 17.5 percent to a massive 65.7 percent. It doesn't take a scientist to understand how alarming these numbers are, as they clearly show that LLMs don't properly solve problems but instead use simple "pattern matching" to "convert statements to operations without truly understanding their meaning." And if you slightly change the information found in those problems, it majorly interferes with the LLMs' ability to recognize those patterns. The main driving force behind these current LLMs is that it's actually performing operations similar to how a human would, but studies like this one and other ones prove otherwise -- there are critical limitations to how they function. It's supposed to employ high-level reasoning but there's no model of the logic or world behind it, severely crippling its actual potential. And when an AI cannot perform simple math because the words are essentially too confusing and don't follow the same exact pattern, what's the point? Are computers not created to perform math at rates that humans normally can not? At this point, you might as well close down the AI chatbot and take out your calculator instead. It's rather disappointing that these current LLMs found in recent AI chatbots all function on this same faulty programming. They're completely reliant on the sheer amount of data they horde and then process to give the illusion of logical reasoning, while never coming close to clearing the next true step in AI capability -- symbol manipulation, through the use of abstract knowledge used in algebra and computer programming. Until then, what are we really doing with AI? What's the purpose of its catastrophic drain on natural resources if it's not even capable of what it has been peddled to do by every corporation that pushes its own version of it? Having so many papers, especially this one, confirming this bitter truth makes the whole endeavor truly feel like a waste of time.
[5]
LLMs can't perform "genuine logical reasoning," Apple researchers suggest
What is going on inside that anthropomorphized digital brain? Credit: Getty Images For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems. The fragility highlighted in these new results helps support previous research suggesting that LLMs use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. "Current LLMs are not capable of genuine logical reasoning," the researchers hypothesize based on these results. "Instead, they attempt to replicate the reasoning steps observed in their training data." In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" -- currently available as a pre-print paper -- the six Apple researchers start with GSM8K's standardized set of over 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values -- so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation. This approach helps avoid any potential "data contamination" that can result from the static GSM8K questions being fed directly into an AI model's training data. At the same time, these incidental changes don't alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K. Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.
[6]
Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities
Apple's AI research team has uncovered significant weaknesses in the reasoning abilities of large language models, according to a newly published study. The study, published on arXiv, outlines Apple's evaluation of a range of leading language models, including those from OpenAI, Meta, and other prominent developers, to determine how well these models could handle mathematical reasoning tasks. The findings reveal that even slight changes in the phrasing of questions can cause major discrepancies in model performance that can undermine their reliability in scenarios requiring logical consistency. Apple draws attention to a persistent problem in language models: their reliance on pattern matching rather than genuine logical reasoning. In several tests, the researchers demonstrated that adding irrelevant information to a question -- details that should not affect the mathematical outcome -- can lead to vastly different answers from the models. One example given in the paper involves a simple math problem asking how many kiwis a person collected over several days. When irrelevant details about the size of some kiwis were introduced, models such as OpenAI's o1 and Meta's Llama incorrectly adjusted the final total, despite the extra information having no bearing on the solution. We found no evidence of formal reasoning in language models. Their behavior is better explained by sophisticated pattern matching -- so fragile, in fact, that changing names can alter results by ~10%. This fragility in reasoning prompted the researchers to conclude that the models do not use real logic to solve problems but instead rely on sophisticated pattern recognition learned during training. They found that "simply changing names can alter results," a potentially troubling sign for the future of AI applications that require consistent, accurate reasoning in real-world contexts. According to the study, all models tested, from smaller open-source versions like Llama to proprietary models like OpenAI's GPT-4o, showed significant performance degradation when faced with seemingly inconsequential variations in the input data. Apple suggests that AI might need to combine neural networks with traditional, symbol-based reasoning called neurosymbolic AI to obtain more accurate decision-making and problem-solving abilities.
[7]
Apple Researchers Suggest 'Fragile' AI Reasoning Capabilities Are Overstated
This suggests that AI models rely on "sophisticated pattern matching more than true logical reasoning" they concluded. According to commonly used benchmarks, frontier large language models (LLMs) have now surpassed the average human's ability to solve mathematical problems and perform complex reasoning. For instance, OpenAI's o1 model recently outperformed human experts on PhD-level science questions. However, a group of Apple researchers (Mirzadeh et al.) have recently highlighted a major flaw in the way AI performance is assessed. By changing the phrasing of the questions just a tiny bit, leading models from OpenAI, Google, Anthropic and Meta saw their ability to answer questions correctly collapse. The Limitations of AI Benchmarks Standardized AI benchmarks make it possible to compare different models' performance. However, if AI developers only measure intelligence using a limited set of benchmarks, they risk creating models that perform exceedingly well on a finite same of predetermined tasks but flounder in the wild. To explore the issue, Mirzadeh et al. modified the commonly used GSM8K benchmark-a set of 8,500 grade school math word problems. The researchers found that even superficial changes such as switching names negatively impacted model performance. When they changed the values, performance dropped more notably. The most significant decrease occurred when they rephrased the question entirely. For example, adding a single irrelevant clause caused performance to decline by up to 65%. Interestingly, the researchers observed this "fragility of mathematical reasoning" across all models they tested, including so-called chain-of-thought (CoT) models like OpenAI's o1 that are meant to be capable of complex reasoning. The Rise of Chain-of-Thought Chain-of-thought first emerged as a form of prompt engineering that breaks down complex prompts into a series of intermediate steps. Although the technique was honed as an additional stage developers could apply to LLM prompts, some models now incorporate CoT into their architecture. With CoT baked in, OpenAI's o1 is much more capable of complex reasoning than its predecessors. The model's lead developer Lukasz Kaiser has argued that the new design approach represents a shift for LLMs that will lead to more concrete logical processes. Yet, for all its apparent advancements, o1 was subject to the same fragile reasoning the Apple researchers observed in other models. AI Still Incapable of Formal Reasoning Despite major performance gains, the researchers concluded that even the most sophisticated LLM operations "resemble sophisticated pattern matching more than true logical reasoning". Nevertheless, their findings do suggest that CoT-based approaches are moving in the right direction. Of all the models assessed, o1 experienced the smallest performance decline between the regular GSM8K questions and the modified ones. In other words, although its reasoning was found to be fragile, it is less fragile than other models.
[8]
Reasoning failures highlighted by Apple research on LLMs
Apple plans to introduce its own version of AI starting with iOS 18.1 - image credit Apple A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills. The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen. "Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases." The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded. A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter. The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday." The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?" The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result. The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers. "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."
[9]
Apple study reveals major AI flaw in OpenAI, Google, and Meta LLMs
Large Language Models (LLMs) may not be as smart as they seem, according to a study from Apple researchers. LLMs from OpenAI, Google, Meta, and others have been touted for their impressive reasoning skills. But research suggests their purported intelligence may be closer to "sophisticated pattern matching" than "true logical reasoning." Yep, even OpenAI's o1 advanced reasoning model. The most common benchmark for reasoning skills is a test called GSM8K, but since it's so popular, there's a risk of data contamination. That means LLMs might know the answers to the test because they were trained on those answers, not because of their inherent intelligence. To test this, the study developed a new benchmark called GSM-Symbolic which keeps the essence of the reasoning problems, but changes the variables, like names, numbers, complexity, and adding irrelevant information. What they discovered was surprising "fragility" in LLM performance. The study tested over 20 models including OpenAI's o1 and GPT-4o, Google's Gemma 2, and Meta's Llama 3. With every single model, the model's performance decreased when the variables were changed. Accuracy decreased by a few percentage points when names and variables were changed. And as the researchers noted, OpenAI's models performed better than the other open-source models. However the variance was deemed "non-negligible," meaning any real variance shouldn't have occurred. However, things got really interesting when researchers added "seemingly relevant but ultimately inconsequential statements" to the mix. To test the hypothesis that LLMs relied more on pattern matching than actual reasoning, the study added superfluous phrases to math problems to see how the models would react. For example, "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" What resulted was a significant drop in performance across the board. OpenAI's o1 Preview fared the best, with a drop of 17.5 percent accuracy. That's still pretty bad, but not as bad as Microsoft's Phi 3 model which performed 65 percent worse. In the kiwi example, the study said LLMs tended to subtract the five smaller kiwis from the equation without understanding that kiwi size was irrelevant to the problem. This indicates that "models tend to convert statements to operations without truly understanding their meaning" which validates the researchers' hypothesis that LLMs look for patterns in reasoning problems, rather than innately understand the concept. The study didn't mince words about its findings. Testing models' on the benchmark that includes irrelevant information "exposes a critical flaw in LLMs' ability to genuinely understand mathematical concepts and discern relevant information for problem-solving." However, it bears mentioning that the authors of this study work for Apple which is obviously a major competitor with Google, Meta, and even OpenAI -- although Apple and OpenAI have a partnership, Apple is also working on its own AI models. That said, the LLMs' apparent lack of formal reasoning skills can't be ignored. Ultimately, it's a good reminder to temper AI hype with healthy skepticism.
[10]
Researchers question AI's 'reasoning' ability as models stumble on math problems with trivial changes
How do machine learning models do what they do? And are they really "thinking" or "reasoning" the way we understand those things? This is a philosophical question as much as a practical one, but a new paper making the rounds Friday suggests that the answer is, at least for now, a pretty clear "no." A group of AI research scientists at Apple released their paper, "Understanding the limitations of mathematical reasoning in large language models," to general commentary Thursday. While the deeper concepts of symbolic learning and pattern reproduction are a bit in the weeds, the basic concept of their research is very easy to grasp. Let's say I asked you to solve a simple math problem like this one: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have? Obviously, the answer is 44 + 58 + (44 * 2) = 190. Though large language models are actually spotty on arithmetic, they can pretty reliably solve something like this. But what if I threw in a little random extra info, like this: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have? It's the same math problem, right? And of course even a grade-schooler would know that even a small kiwi is still a kiwi. But as it turns out, this extra data point confuses even state-of-the-art LLMs. Here's GPT-o1-mini's take: ... on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday's kiwis) - 5 (smaller kiwis) = 83 kiwis This is just a simple example out of hundreds of questions that the researchers lightly modified, but nearly all of which led to enormous drops in success rates for the models attempting them. Now, why should this be? Why would a model that understands the problem be thrown off so easily by a random, irrelevant detail? The researchers propose that this reliable mode of failure means the models don't really understand the problem at all. Their training data does allow them to respond with the correct answer in some situations, but as soon as the slightest actual "reasoning" is required, such as whether to count small kiwis, they start producing weird, unintuitive results. As the researchers put it in their paper: [W]e investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. This observation is consistent with the other qualities often attributed to LLMs due to their facility with language. When, statistically, the phrase "I love you" is followed by "I love you, too," the LLM can easily repeat that -- but it doesn't mean it loves you. And although it can follow complex chains of reasoning it has been exposed to before, the fact that this chain can be broken by even superficial deviations suggests that it doesn't actually reason so much as replicate patterns it has observed in its training data. Mehrdad Farajtabar, one of the co-authors, breaks down the paper very nicely in this thread on X. An OpenAI researcher, while commending Mirzadeh et al's work, objected to their conclusions, saying that correct results could likely be achieved in all these failure cases with a bit of prompt engineering. Farajtabar (responding with the typical yet admirable friendliness researchers tend to employ) noted that while better prompting may work for simple deviations, the model may require exponentially more contextual data in order to counter complex distractions -- ones that, again, a child could trivially point out. Does this mean that LLMs don't reason? Maybe. That they can't reason? No one knows. These are not well-defined concepts, and the questions tend to appear at the bleeding edge of AI research, where the state of the art changes on a daily basis. Perhaps LLMs "reason," but in a way we don't yet recognize or know how to control. It makes for a fascinating frontier in research, but it's also a cautionary tale when it comes to how AI is being sold. Can it really do the things they claim, and if it does, how? As AI becomes an everyday software tool, this kind of question is no longer academic.
[11]
A New Apple Study Shows AI Reasoning Has Critical Flaws
It's no surprise that AI doesn't always get things right. Occasionally, it even hallucinates. However, a recent study by Apple researchers has shown even more significant flaws within the mathematical models used by AI for formal reasoning. As part of the study, Apple scientists asked an AI Large Language Model (LLM) a question, multiple times, in slightly varying ways, and were astounded when they found the LLM offered unexpected variations in the answers. These variations were most prominent when numbers were involved. Apple's Study Suggests Big Problems With AI's Reliability The research, published by arxiv.org, concluded there was "significant performance variability across different instantiations of the same question, challenging the reliability of current GSM8K results that rely on single point accuracy metrics." GSM8K is a dataset which includes over 8000 diverse grade-school math questions and answers. Apple researchers identified the variance in this performance could be as much as 10%. And even slight variations in prompts can cause colossal problems with the reliability of the LLM's answers. In other words, you might want to fact-check your answers anytime you use something like ChatGPT. That's because, while it may sometimes look like AI is using logic to give you answers to your inquiries, logic isn't what's being used. AI, instead, relies on pattern recognition to provide responses to prompts. However, the Apple study shows how changing even a few unimportant words can alter that pattern recognition. One example of the critical variance presented came about through a problem regarding collecting kiwis over several days. Apple researchers conducted a control experiment, then added some inconsequential information about kiwi size. Both Meta and OpenAI Models Showed Issues Meta's Llama, and OpenAI's, 01 then altered their answers to the problem from the control despite kiwi size data having no tangible influence on the problem's outcome. OpenAI's GPT-4o also had issues with its performance when introducing tiny variations in the data given to the LLM. Since LLMs are becoming more prominent in our culture, this news raises a tremendous concern about whether we can trust AI to provide accurate answers to our inquiries. Especially for issues like financial advice. It also reinforces the need to accurately verify the information you receive when using large language models. That means you'll want to do some critical thinking and due diligence instead of blindly relying on AI. Then again, if you're someone who uses AI regularly, you probably already knew that.
[12]
Apple's Shocking AI Revelation: Are Language Models Just Pattern Machines?
Apple's recent research paper, "GSM Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," challenges the perceived reasoning capabilities of current large language models (LLMs). The study suggests that these models primarily rely on pattern recognition rather than genuine logical reasoning, raising concerns about their effectiveness in real-world applications. It appears that these models are more akin to skilled mimics than true thinkers, emphasizing their reliance on pattern recognition. This revelation could have significant implications for how we use and develop AI technologies in the future. Imagine a world where AI is seamlessly integrated into critical areas like education and healthcare, making decisions that impact our daily lives. Sounds promising, right? However, what if these systems falter when faced with unfamiliar situations or irrelevant details? Apple's research highlights a crucial gap in the reasoning capabilities of current LLMs, suggesting that merely scaling up data and computational power may not bridge this divide. While this prospect may sound daunting, it also opens the door to exciting possibilities for innovation. By understanding and addressing these limitations, we can pave the way for AI systems that not only excel in pattern recognition but also demonstrate true logical reasoning, ensuring they become reliable partners in our increasingly complex world. Apple's recent research paper, provides a critical analysis of the reasoning capabilities in current large language models (LLMs). Challenging the widespread belief that these models possess genuine logical reasoning abilities, revealing instead a significant reliance on pattern recognition. These findings have far-reaching implications for the practical applications of LLMs and the future development of artificial intelligence. While you might assume that advanced models like GPT-4 possess robust reasoning skills, Apple's research suggests a different reality. These models often replicate reasoning steps from their training data without truly comprehending the underlying problems. This dependence on pattern recognition, rather than authentic logical reasoning, raises substantial concerns about their effectiveness in handling complex tasks. The research highlights several crucial points: Traditional benchmarks, such as GSM 8K, often report high accuracy rates for LLMs. However, these metrics may not accurately reflect genuine improvements in reasoning capabilities. Apple's introduction of the GSM Symbolic benchmark reveals significant performance discrepancies when only names and values are altered in test questions. This finding suggests that previous benchmarks might not fully capture the models' true reasoning abilities, potentially leading to overestimation of their capabilities. Here are more guides from our previous articles and guides that you may find helpful. A key finding of the research is the models' sensitivity to irrelevant information. When extraneous details are added to test questions, significant performance drops occur. This vulnerability to changes in names and numbers indicates potential issues with overfitting and data contamination. Such sensitivities could severely hinder the models' application in dynamic real-world environments, where data is rarely static or predictable. These performance challenges manifest in several ways: The research suggests that simply scaling up data, models, or computational power may not address these fundamental reasoning limitations. For AI to progress beyond sophisticated pattern recognition, new approaches are necessary. This insight is crucial for developing models that can achieve true logical reasoning, a capability vital for their effective deployment across various fields. Future AI development strategies should consider: The ability to reason accurately and consistently is essential for AI applications in critical areas such as education, healthcare, and decision-making systems. Understanding the limitations of LLMs' reasoning capabilities is crucial for making sure AI safety and alignment with human values. Without addressing these issues, the deployment of AI in sensitive domains could lead to unreliable or potentially harmful outcomes. Key considerations for real-world applications include: Apple's study serves as a call to action for innovative strategies to enhance reasoning capabilities in AI models. Identifying and addressing these limitations is essential for advancing towards more sophisticated AI systems, including the long-term goal of Artificial General Intelligence (AGI). By focusing on these challenges, researchers and developers can contribute to the creation of AI systems that are not only more intelligent but also more reliable and aligned with human needs and ethical considerations. Future research directions may include: As AI continues to evolve, understanding and overcoming these reasoning limitations will be crucial in shaping the future of intelligent systems. This research from Apple not only highlights current shortcomings but also opens new avenues for innovation in AI development, potentially leading to more capable, reliable, and truly intelligent AI systems in the future.
[13]
Artificial intelligence does not reason, according to Apple, but is there a solution? - Softonic
Apple's artificial intelligence research team has published an interesting paper on the weaknesses in the reasoning capabilities of language models. In the paper, available on arXiv (via Macrumors), the team explains how it evaluated a series of language models from different leading developers, including OpenAI and Meta, to determine their ability in solving mathematical and logical reasoning problems. The results point to a concerning fragility in the performance of these models, which seems to be more linked to pattern matching than to logical reasoning itself. One of the most notable findings of the study is that small variations in the formulation of a question can trigger large discrepancies in the models' responses. In situations where logical coherence and precision are required, this inconsistency undermines the reliability of these AIs. For example, when posing an apparently simple mathematical question, the inclusion of irrelevant details can lead to incorrect answers. In one of the tests, a math problem asked how many kiwis a person had collected over several days. By introducing extra information, such as the size of some kiwis, the models, including OpenAI's o1 and Meta's Llama, got the total wrong, even though those details did not affect the final result at all. According to the Apple team, the models are not applying logical reasoning, but are using patterns learned during their training to "guess" the answers. The study highlights that even a change as minor as the names used in the questions can alter the results by 10%. The main concern arising from these findings is that current AI models are not capable of authentic reasoning. Instead of using logic, these systems recognize complex patterns in the data they were trained on, allowing them to generate convincing responses across a wide variety of tasks. However, this approach has a clear limitation: when the task requires consistent and precise reflection, AI often fails. In light of this situation, Apple suggests a possible solution: the combination of neural networks with traditional symbolic reasoning, an approach known as neurosymbolic AI. This hybrid approach aims to leverage the best of both worlds. Neural networks are excellent for pattern recognition and natural language processing tasks, but they lack the logical reasoning capabilities needed in many scenarios. By integrating symbolic techniques, which are more rigid but much more precise in terms of logic, AIs could improve in decision-making and problem-solving. The results of Apple's study highlight a key limitation of current AI technologies. Although it may not seem like it, and while traces of more Apple Intelligence functions appear, we are in the early stages of developing artificial intelligences and still exploring what they are capable of. In this context, research like this sets a clear path to follow when it comes to evolving these tools. A path where AIs are capable of reasoning and providing us with precision and coherence when we need it.
[14]
Apple says a high score on GSM8K dataset does not mean your AI is smarter
Recent research from Apple suggests that models that got a high score on the GSM8K dataset may not be as intelligent as they seem. Large Language Models (LLMs) have been widely praised for their seemingly impressive reasoning abilities. Models from companies like OpenAI, Google, and Meta are often showcased as powerful tools capable of solving complex problems, with tests like the GSM8K dataset being a popular benchmark to measure their reasoning skills. Yet, Apple's research is set to change the so-called trustworthy system. The GSM8K dataset (Grade School Math 8K) is a benchmark used to evaluate the problem-solving and reasoning abilities of Large Language Models (LLMs). It consists of over 8,000 grade-school level math word problems, which typically require arithmetic, logical reasoning, and multi-step problem-solving skills to arrive at the correct answer. The GSM8K dataset consists of: The GSM8K dataset has become a popular tool to assess whether LLMs can reason logically and solve real-world problems. However, there is concern that many AI models perform well on this dataset through pattern matching rather than true reasoning, as they might have been exposed to similar problems during training. Apple researchers argue that this success may be more about sophisticated pattern matching than genuine logical reasoning. Since the GSM8K dataset is so commonly used, there's a risk of data contamination -- meaning that many LLMs may have already seen these problems during training, inflating their apparent intelligence. To address this, Apple developed a new benchmark called GSM-Symbolic. This test retains the core reasoning elements of the GSM8K dataset but introduces changes like different names, numbers, and complexity, along with irrelevant information. The results? Every LLM tested, including models like OpenAI's GPT-4 and Meta's Llama 3, saw a significant drop in performance when faced with this new challenge. This suggests that LLMs struggle with true reasoning when variables are altered, further questioning their actual problem-solving skills. The study by Apple sheds light on a critical flaw in LLMs: They are excellent at detecting patterns in the training data but lack true logical reasoning. For example, when math problems included irrelevant details, such as the size of kiwis in a fruit-picking scenario, many LLMs subtracted that irrelevant detail from the equation, demonstrating a failure to discern which information was necessary to solve the problem. In tests with the GSM8K dataset, LLMs like OpenAI's models performed better than their open-source counterparts, but the drop in accuracy when irrelevant information was added suggests that these systems are far from achieving genuine intelligence. This has profound implications for the future development of AI, showing that while LLMs may mimic intelligence, they still struggle to truly understand context. Apple's research underscores the limitations of relying on benchmarks like the GSM8K dataset to assess AI intelligence. While these tests can measure pattern recognition, they don't always capture the nuances of true logical reasoning. The introduction of the GSM-Symbolic benchmark provides a more rigorous test of an AI's ability to handle unfamiliar variables and irrelevant information -- skills essential for real-world problem-solving. Sam Altman, CEO of OpenAI, has even acknowledged these challenges, referring to current LLMs as "incredibly dumb" despite their impressive outward appearance in an exclusive interview with MIT Technology Review. The real test for future LLMs will be their ability to go beyond pattern recognition and develop more robust problem-solving abilities. The findings from Apple's study offer a sobering perspective on the current state of LLMs. While models trained on datasets like GSM8K may perform well in controlled environments, their reasoning abilities falter when tested on more complex, real-world problems. This highlights the importance of further research and development to ensure that AI models move beyond surface-level intelligence and develop true logical reasoning skills. For now, it's crucial to temper the excitement surrounding AI with healthy skepticism, focusing on safer, smarter AI systems that can handle more than just pattern recognition.
[15]
Apple agrees with Sam Altman: AI is incredibly dumb
Is ChatGPT getting smarter, or is it getting better at seeming smart? According to Apple, it's the latter. A team of AI researchers at Apple published a paper this weekend claiming that most leading large language AI models aren't actually capable of advanced reasoning, despite how intelligent they might seem. Large language models, or LLMs, like ChatGPT appear to be getting more advanced and "intelligent" every year. Under the hood, though, their logical reasoning hasn't improved much. According to Apple's research, current LLMs' capabilities "may resemble sophisticated pattern matching more than true logical reasoning." What does this research mean for the reality of today's top AI models? It might be time to focus on creating safer AI models before trying to build smarter ones. A team of AI researchers at Apple has revealed the findings of a new benchmark test, GSM-Symbolic, which posed a whole new challenge for large language models. The test revealed that today's top AI models have limited reasoning capabilities, despite how intelligent they might seem. In fact, the GSM-Symbolic test revealed that the AI models in the study struggled with basic grade school math problems. The more complex the questions became, the worse the AIs performed. The researchers explain in their paper, "Adding seemingly relevant but ultimately inconsequential information to the logical reasoning of the problem led to substantial performance drops of up to 65% across all state-of-the-art models. "Importantly, we demonstrate that LLMs struggle even when provided with multiple examples of the same question or examples containing similar irrelevant information." This means today's leading AI models are easily confused by logic-based questions such as math problems. They rely on copying the patterns in math problems in their training data but struggle to do math the way a human can. This shows that large language models only appear to be smart, when, in reality, they're just really good at acting smart. This echoes OpenAI CEO Sam Altman's remarks, claiming AI is actually "incredibly dumb" in its current state. OpenAI is the company behind ChatGPT and Altman has been ambitious in his pursuit of artificial general intelligence, which would be capable of true logical reasoning. Apple's study seems to agree. It concludes, "We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills." If the research published by Apple's AI team is accurate, today's leading large language models struggle to hold up on an episode of Are You Smarter Than a Fifth Grader. However, that doesn't mean AI can't still be a powerful tool, one that can be incredibly helpful... or harmful. In fact, the Apple study reveals a core strength and potential danger of AI: its ability to mimic. LLMs like ChatGPT may seem capable of reasoning the way humans are, but as this study points out, that's just the AI copying human language and patterns. That might not be as advanced as actual logical reasoning, but AI has gotten extremely good at mimicking others. Unfortunately, bad actors have been quick to take advantage of every advancement. For example, this weekend tech YouTuber Marques Brownlee announced on X that a company used AI to replicate his voice in an ad for their product, which Brownlee was not affiliated with. The AI-generated decoy is shockingly similar to Brownlee's real voice, though. The ad was clearly intended to deceive viewers into thinking Brownlee was endorsing their product. Unfortunately, incidents like this are becoming more common, from fake presidential endorsements from Taylor Swift to Scarlett Johansson's claims that OpenAI copied her voice without her permission. Average users might not think these controversies affect them, but they're arguably the most critical aspect of the AI industry. It's great that basic tools like ChatGPT or Gemini are useful to many people. However, the ways AI is also being misused for deep fakes, deception, and scams pose a serious risk to the safety of this technology and everyone who interacts with it, knowingly or otherwise.
[16]
Apple researchers suggest artificial intelligence is still mostly an illusion
Researchers at Apple Computer Company have found evidence, via testing, showing that the seemingly intelligent responses given by AI-based LLMs are little more than an illusion. In their paper posted on the arXiv preprint server, the researchers argue that after testing several LLMs, they found that they are not capable of performing genuine logical reasoning. Over the past few years, many LLMs such as ChatGPT have developed to the point that many users have begun to wonder if they possess true intelligence. In this new effort, the team at Apple has addressed the question by assuming the answer lies in the ability of an intelligent being, or machine, to understand the nuances present in simple situations, which require logical reasoning. One such nuance is the ability to separate pertinent information from information that is not pertinent. If a child asks a parent how many apples are in a bag, for example, while also noting that several are too small to eat, both the child and parent understand that the size of the apples has nothing to do with the number of them present. This is because they both possess logical reasoning abilities. In this new study, the researchers tested several LLMs on their ability to truly understand what it is being asked, by asking them indirectly to ignore information that is not pertinent. Their testing involved asking multiple LLMs hundreds of questions that have been used before as a means of testing the abilities of LLMs -- but the researchers also included a bit of non-pertinent information. And that, they found, was enough to confuse the LLMs into giving wrong or even nonsensical answers to questions they had previously answered correctly. This, the researchers suggest, shows that the LLMs do not really understand what they are being asked. They instead recognize the structure of a sentence and then spit out an answer based on what they have learned through machine-learning algorithms. They also note that most of the LLMs they tested very often respond with answers that can seem correct, but upon further review are not, such as when asked how they "feel" about something and get responses that suggest the AI thinks it is capable of such behavior.
[17]
Apple Proves OpenAI o1 is Actually Good at Reasoning
While some say LLMs are our ticket to AGI, others think they're just glorified text-producing algorithms with a fancy name. Apple has gotten better at gaslighting AI companies that are spending all they have on making LLMs better at reasoning. A research team of six people at Apple recently published a paper titled - Understanding the Limitations of Mathematical Reasoning in Large Language Models - which basically said that the current LLMs can't reason. "...current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data," reads the paper, which also includes LLMs like OpenAI's GPT-4o and even the much-touted "thinking and reasoning" LLM, o1. The research was done on a series of other models as well, such as Llama, Phi, Gemma, and Mistral. Mehrdad Farajtabar, the senior author of the paper, posted on X explaining how the team came to the conclusion. According to him, LLMs just follow sophisticated patterns and even models smaller than 3 billion parameters are hitting benchmarks that only larger ones could do earlier, specifically the GSM8K score released by OpenAI three years ago. The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and thus, not reliable for testing the reasoning abilities of LLMs. Surprisingly, on this benchmark, OpenAI's o1 demonstrated "strong performance on various reasoning and knowledge-based benchmarks" according to the researchers, but the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions. This proves that the "reasoning" capabilities of OpenAI's models are definitely getting better, and maybe GPT-5 would be a lot better. Maybe it's just Apple's LLMs that don't reason well, but the team didn't test out Apple's model. Also, not everyone is happy with the research paper as it fails to even explain what "reasoning" actually means and just introduces a new benchmark for evaluating LLMs. "Overall, we found no evidence of formal reasoning in language models...their behaviour is better explained by sophisticated pattern matching -- so fragile, in fact, that changing names can alter results by ~10%!" Farajtabar further added that scaling these models would just result in 'better pattern machines' but not 'better reasoners'. Some people have been making this claim all along that LLMs cannot reason and they are an off road to AGI. Possibly, Apple has finally accepted this after trying out LLMs on their products and this is possibly also one of the reasons why it backed out of its investment in OpenAI. Most of the researchers have been praising the paper by Apple and believe that it is important that others also accept that LLMs cannot reason. Gary Marcus, a long-standing critic of LLMs, also shared several examples of LLMs not able to perform reasoning tasks such as calculation and being better at Chess. On the other hand, a problem with Apple's paper is that it has confused reasoning with computation. "Reasoning is knowing an algorithm to solve a problem, not solving all of it in your head," said Paras Chopra, an AI researcher, while explaining that most of the LLMs know the approach to solving a problem even though they end up with the wrong answer in the end. According to him, knowing the approach is good enough to check if the LLM is reasoning even if the answer is wrong. Discussions on Hacker News highlight that some of the questions that the Apple researchers asked LLMs were trying to do a "gotcha!" on them, as they included irrelevant information in questions, which LLMs would not be able to actively filter out. Reasoning is the progressive, iterative reduction of informational entropy in a knowledge domain. OpenAI's o1-preview does that better by introducing iteration. It's not perfect, but it does it. Subbarao Kambhampati, a computer science and AI professor at ASU, agreed that some of the claims of LLMs being capable of reasoning are exaggerated. However, he said that LLMs require more tools to handle System 2 tasks (reasoning), for which techniques like 'fine-tuning' or 'Chain of Thought' are not adequate. When OpenAI released o1, claiming that the model thinks and reasons, Clem Delangue, the CEO of Hugging Face, was not impressed. "Once again, an AI system is not 'thinking', it's 'processing', 'running predictions',... just like Google or computers do," said Delangue, when talking about how OpenAI is painting the false picture of what the company's newest model can achieve. While some agreed, others argued that it is exactly how human brains work as well. "Once again, human minds aren't 'thinking' they are just executing a complex series of bio-chemical / bio-electrical computing operations at massive scale," replied Phillip Rhodes to Delangue. To test reasoning, some people also ask LLMs how many Rs are there in the word 'Strawberry', which does not make sense at all. LLMs can't count letters directly because they process text in chunks called "tokens". The tests for reasoning have been problematic in the case of LLMs ever since they were created. Everyone seems to have strong opinions on LLMs. While some are grounded in research by experts such as Yann LeCun or Francois Chollet arguing that LLM research should be taken a bit more seriously, others just follow the hype and criticise it. Some say they're our ticket to the AGI, while others think they're just glorified text-producing algorithms with a fancy name. Meanwhile, Andrej Karpathy recently said that the technique of predicting the next token that these LLMs, or Transformers, use might be able to solve a lot of problems outside the realm of where it is being used right now. While it seems true to some extent that LLMs can reason, when it comes to putting them to the test, they end up failing it.
Share
Share
Copy Link
A recent study by Apple researchers exposes significant flaws in the mathematical reasoning capabilities of large language models (LLMs), challenging the notion of AI's advanced reasoning skills and raising questions about their real-world applications.
A team of six Apple researchers has cast doubt on the mathematical prowess of large language models (LLMs), challenging the notion that artificial intelligence (AI) is approaching human-like reasoning capabilities. The study, titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," reveals significant weaknesses in AI systems when faced with tasks requiring robust logical reasoning 1.
The researchers utilized the GSM8K benchmark, a set of over 8,000 grade-school level mathematical word problems, to evaluate the performance of more than 20 state-of-the-art LLMs. They introduced two key modifications to the original benchmark:
The results were striking:
These findings suggest that current LLMs may not be capable of genuine logical reasoning. Instead, they appear to rely on pattern matching and replication of reasoning steps observed in their training data 4.
Dr. Selmer Bringsjord, professor at Rensselaer Polytechnic Institute, commented, "Any real-world application that requires reasoning of the sort that can be definitively verified (or not) is basically impossible for an LLM to get right with any degree of consistency" 1.
The implications of these limitations for AI applications in commerce and decision-making are significant. Financial institutions and other sectors relying on AI for complex calculations may need to reassess their use of these technologies 1.
However, not all experts view these limitations as equally problematic. Aravind Chandramouli, head of AI at Tredence, suggests that the impact on real-world applications may be minimal, as most do not require advanced mathematical reasoning 1.
Researchers and industry professionals are exploring several approaches to address these limitations:
Eric Bravick, CEO of The Lifted Initiative, suggests that emerging technologies like retrieval-augmented generation (RAG) systems and multimodal AI could help address current limitations in AI reasoning 1.
This study emphasizes the need for more robust and adaptable evaluation methods for AI models. Lead study author Mehrdad Farajtabar stressed the importance of understanding LLMs' true reasoning capabilities for deploying them in real-world scenarios where accuracy and consistency are crucial 3.
As the field of AI continues to evolve, these findings highlight the significant work still needed to achieve artificial general intelligence (AGI) and underscore the importance of careful evaluation and testing of AI systems, particularly for high-stakes applications requiring reliable reasoning 5.
Reference
[1]
Apple researchers conducted tests revealing significant limitations in AI models' ability to perform simple arithmetic and logical reasoning, raising questions about the true intelligence of current AI systems.
2 Sources
2 Sources
Epoch AI's FrontierMath, a new mathematics benchmark, reveals that leading AI models struggle with complex mathematical problems, solving less than 2% of the challenges.
8 Sources
8 Sources
A study by USC researchers reveals that AI models, particularly open-source ones, struggle with abstract visual reasoning tasks similar to human IQ tests. While closed-source models like GPT-4V perform better, they still fall short of human cognitive abilities.
4 Sources
4 Sources
Recent research reveals that while larger AI language models demonstrate enhanced capabilities in answering questions, they also exhibit a concerning trend of increased confidence in incorrect responses. This phenomenon raises important questions about the development and deployment of advanced AI systems.
5 Sources
5 Sources
A new study reveals that state-of-the-art AI language models perform poorly on a test of understanding meaningful word combinations, highlighting limitations in their ability to make sense of language like humans do.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved