Apple Research Reveals Limitations in AI Models' Reasoning Abilities

Curated by THEOUTPOST

On October 13, 2024

10 Sources

[1]

Dataconomy

Apple says a high score on GSM8K dataset does not mean your AI is smarter

Recent research from Apple suggests that models that got a high score on the GSM8K dataset may not be as intelligent as they seem. Large Language Models (LLMs) have been widely praised for their seemingly impressive reasoning abilities. Models from companies like OpenAI, Google, and Meta are often showcased as powerful tools capable of solving complex problems, with tests like the GSM8K dataset being a popular benchmark to measure their reasoning skills. Yet, Apple's research is set to change the so-called trustworthy system. The GSM8K dataset (Grade School Math 8K) is a benchmark used to evaluate the problem-solving and reasoning abilities of Large Language Models (LLMs). It consists of over 8,000 grade-school level math word problems, which typically require arithmetic, logical reasoning, and multi-step problem-solving skills to arrive at the correct answer. The GSM8K dataset consists of: The GSM8K dataset has become a popular tool to assess whether LLMs can reason logically and solve real-world problems. However, there is concern that many AI models perform well on this dataset through pattern matching rather than true reasoning, as they might have been exposed to similar problems during training. Apple researchers argue that this success may be more about sophisticated pattern matching than genuine logical reasoning. Since the GSM8K dataset is so commonly used, there's a risk of data contamination -- meaning that many LLMs may have already seen these problems during training, inflating their apparent intelligence. To address this, Apple developed a new benchmark called GSM-Symbolic. This test retains the core reasoning elements of the GSM8K dataset but introduces changes like different names, numbers, and complexity, along with irrelevant information. The results? Every LLM tested, including models like OpenAI's GPT-4 and Meta's Llama 3, saw a significant drop in performance when faced with this new challenge. This suggests that LLMs struggle with true reasoning when variables are altered, further questioning their actual problem-solving skills. The study by Apple sheds light on a critical flaw in LLMs: They are excellent at detecting patterns in the training data but lack true logical reasoning. For example, when math problems included irrelevant details, such as the size of kiwis in a fruit-picking scenario, many LLMs subtracted that irrelevant detail from the equation, demonstrating a failure to discern which information was necessary to solve the problem. In tests with the GSM8K dataset, LLMs like OpenAI's models performed better than their open-source counterparts, but the drop in accuracy when irrelevant information was added suggests that these systems are far from achieving genuine intelligence. This has profound implications for the future development of AI, showing that while LLMs may mimic intelligence, they still struggle to truly understand context. Apple's research underscores the limitations of relying on benchmarks like the GSM8K dataset to assess AI intelligence. While these tests can measure pattern recognition, they don't always capture the nuances of true logical reasoning. The introduction of the GSM-Symbolic benchmark provides a more rigorous test of an AI's ability to handle unfamiliar variables and irrelevant information -- skills essential for real-world problem-solving. Sam Altman, CEO of OpenAI, has even acknowledged these challenges, referring to current LLMs as "incredibly dumb" despite their impressive outward appearance in an exclusive interview with MIT Technology Review. The real test for future LLMs will be their ability to go beyond pattern recognition and develop more robust problem-solving abilities. The findings from Apple's study offer a sobering perspective on the current state of LLMs. While models trained on datasets like GSM8K may perform well in controlled environments, their reasoning abilities falter when tested on more complex, real-world problems. This highlights the importance of further research and development to ensure that AI models move beyond surface-level intelligence and develop true logical reasoning skills. For now, it's crucial to temper the excitement surrounding AI with healthy skepticism, focusing on safer, smarter AI systems that can handle more than just pattern recognition.

[2]

LaptopMag

Apple agrees with Sam Altman: AI is incredibly dumb

Is ChatGPT getting smarter, or is it getting better at seeming smart? According to Apple, it's the latter. A team of AI researchers at Apple published a paper this weekend claiming that most leading large language AI models aren't actually capable of advanced reasoning, despite how intelligent they might seem. Large language models, or LLMs, like ChatGPT appear to be getting more advanced and "intelligent" every year. Under the hood, though, their logical reasoning hasn't improved much. According to Apple's research, current LLMs' capabilities "may resemble sophisticated pattern matching more than true logical reasoning." What does this research mean for the reality of today's top AI models? It might be time to focus on creating safer AI models before trying to build smarter ones. A team of AI researchers at Apple has revealed the findings of a new benchmark test, GSM-Symbolic, which posed a whole new challenge for large language models. The test revealed that today's top AI models have limited reasoning capabilities, despite how intelligent they might seem. In fact, the GSM-Symbolic test revealed that the AI models in the study struggled with basic grade school math problems. The more complex the questions became, the worse the AIs performed. The researchers explain in their paper, "Adding seemingly relevant but ultimately inconsequential information to the logical reasoning of the problem led to substantial performance drops of up to 65% across all state-of-the-art models. "Importantly, we demonstrate that LLMs struggle even when provided with multiple examples of the same question or examples containing similar irrelevant information." This means today's leading AI models are easily confused by logic-based questions such as math problems. They rely on copying the patterns in math problems in their training data but struggle to do math the way a human can. This shows that large language models only appear to be smart, when, in reality, they're just really good at acting smart. This echoes OpenAI CEO Sam Altman's remarks, claiming AI is actually "incredibly dumb" in its current state. OpenAI is the company behind ChatGPT and Altman has been ambitious in his pursuit of artificial general intelligence, which would be capable of true logical reasoning. Apple's study seems to agree. It concludes, "We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills." If the research published by Apple's AI team is accurate, today's leading large language models struggle to hold up on an episode of Are You Smarter Than a Fifth Grader. However, that doesn't mean AI can't still be a powerful tool, one that can be incredibly helpful... or harmful. In fact, the Apple study reveals a core strength and potential danger of AI: its ability to mimic. LLMs like ChatGPT may seem capable of reasoning the way humans are, but as this study points out, that's just the AI copying human language and patterns. That might not be as advanced as actual logical reasoning, but AI has gotten extremely good at mimicking others. Unfortunately, bad actors have been quick to take advantage of every advancement. For example, this weekend tech YouTuber Marques Brownlee announced on X that a company used AI to replicate his voice in an ad for their product, which Brownlee was not affiliated with. The AI-generated decoy is shockingly similar to Brownlee's real voice, though. The ad was clearly intended to deceive viewers into thinking Brownlee was endorsing their product. Unfortunately, incidents like this are becoming more common, from fake presidential endorsements from Taylor Swift to Scarlett Johansson's claims that OpenAI copied her voice without her permission. Average users might not think these controversies affect them, but they're arguably the most critical aspect of the AI industry. It's great that basic tools like ChatGPT or Gemini are useful to many people. However, the ways AI is also being misused for deep fakes, deception, and scams pose a serious risk to the safety of this technology and everyone who interacts with it, knowingly or otherwise.

[3]

Mashable

Apple study reveals major AI flaw in OpenAI, Google, and Meta LLMs

Large Language Models (LLMs) may not be as smart as they seem, according to a study from Apple researchers. LLMs from OpenAI, Google, Meta, and others have been touted for their impressive reasoning skills. But research suggests their purported intelligence may be closer to "sophisticated pattern matching" than "true logical reasoning." Yep, even OpenAI's o1 advanced reasoning model. The most common benchmark for reasoning skills is a test called GSM8K, but since it's so popular, there's a risk of data contamination. That means LLMs might know the answers to the test because they were trained on those answers, not because of their inherent intelligence. To test this, the study developed a new benchmark called GSM-Symbolic which keeps the essence of the reasoning problems, but changes the variables, like names, numbers, complexity, and adding irrelevant information. What they discovered was surprising "fragility" in LLM performance. The study tested over 20 models including OpenAI's o1 and GPT-4o, Google's Gemma 2, and Meta's Llama 3. With every single model, the model's performance decreased when the variables were changed. Accuracy decreased by a few percentage points when names and variables were changed. And as the researchers noted, OpenAI's models performed better than the other open-source models. However the variance was deemed "non-negligible," meaning any real variance shouldn't have occurred. However, things got really interesting when researchers added "seemingly relevant but ultimately inconsequential statements" to the mix. To test the hypothesis that LLMs relied more on pattern matching than actual reasoning, the study added superfluous phrases to math problems to see how the models would react. For example, "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?" What resulted was a significant drop in performance across the board. OpenAI's o1 Preview fared the best, with a drop of 17.5 percent accuracy. That's still pretty bad, but not as bad as Microsoft's Phi 3 model which performed 65 percent worse. In the kiwi example, the study said LLMs tended to subtract the five smaller kiwis from the equation without understanding that kiwi size was irrelevant to the problem. This indicates that "models tend to convert statements to operations without truly understanding their meaning" which validates the researchers' hypothesis that LLMs look for patterns in reasoning problems, rather than innately understand the concept. The study didn't mince words about its findings. Testing models' on the benchmark that includes irrelevant information "exposes a critical flaw in LLMs' ability to genuinely understand mathematical concepts and discern relevant information for problem-solving." However, it bears mentioning that the authors of this study work for Apple which is obviously a major competitor with Google, Meta, and even OpenAI -- although Apple and OpenAI have a partnership, Apple is also working on its own AI models. That said, the LLMs' apparent lack of formal reasoning skills can't be ignored. Ultimately, it's a good reminder to temper AI hype with healthy skepticism.

[4]

Geeky Gadgets

Apple's Shocking AI Revelation: Are Language Models Just Pattern Machines?

Apple's recent research paper, "GSM Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," challenges the perceived reasoning capabilities of current large language models (LLMs). The study suggests that these models primarily rely on pattern recognition rather than genuine logical reasoning, raising concerns about their effectiveness in real-world applications. It appears that these models are more akin to skilled mimics than true thinkers, emphasizing their reliance on pattern recognition. This revelation could have significant implications for how we use and develop AI technologies in the future. Imagine a world where AI is seamlessly integrated into critical areas like education and healthcare, making decisions that impact our daily lives. Sounds promising, right? However, what if these systems falter when faced with unfamiliar situations or irrelevant details? Apple's research highlights a crucial gap in the reasoning capabilities of current LLMs, suggesting that merely scaling up data and computational power may not bridge this divide. While this prospect may sound daunting, it also opens the door to exciting possibilities for innovation. By understanding and addressing these limitations, we can pave the way for AI systems that not only excel in pattern recognition but also demonstrate true logical reasoning, ensuring they become reliable partners in our increasingly complex world. Apple's recent research paper, provides a critical analysis of the reasoning capabilities in current large language models (LLMs). Challenging the widespread belief that these models possess genuine logical reasoning abilities, revealing instead a significant reliance on pattern recognition. These findings have far-reaching implications for the practical applications of LLMs and the future development of artificial intelligence. While you might assume that advanced models like GPT-4 possess robust reasoning skills, Apple's research suggests a different reality. These models often replicate reasoning steps from their training data without truly comprehending the underlying problems. This dependence on pattern recognition, rather than authentic logical reasoning, raises substantial concerns about their effectiveness in handling complex tasks. The research highlights several crucial points: Traditional benchmarks, such as GSM 8K, often report high accuracy rates for LLMs. However, these metrics may not accurately reflect genuine improvements in reasoning capabilities. Apple's introduction of the GSM Symbolic benchmark reveals significant performance discrepancies when only names and values are altered in test questions. This finding suggests that previous benchmarks might not fully capture the models' true reasoning abilities, potentially leading to overestimation of their capabilities. Here are more guides from our previous articles and guides that you may find helpful. A key finding of the research is the models' sensitivity to irrelevant information. When extraneous details are added to test questions, significant performance drops occur. This vulnerability to changes in names and numbers indicates potential issues with overfitting and data contamination. Such sensitivities could severely hinder the models' application in dynamic real-world environments, where data is rarely static or predictable. These performance challenges manifest in several ways: The research suggests that simply scaling up data, models, or computational power may not address these fundamental reasoning limitations. For AI to progress beyond sophisticated pattern recognition, new approaches are necessary. This insight is crucial for developing models that can achieve true logical reasoning, a capability vital for their effective deployment across various fields. Future AI development strategies should consider: The ability to reason accurately and consistently is essential for AI applications in critical areas such as education, healthcare, and decision-making systems. Understanding the limitations of LLMs' reasoning capabilities is crucial for making sure AI safety and alignment with human values. Without addressing these issues, the deployment of AI in sensitive domains could lead to unreliable or potentially harmful outcomes. Key considerations for real-world applications include: Apple's study serves as a call to action for innovative strategies to enhance reasoning capabilities in AI models. Identifying and addressing these limitations is essential for advancing towards more sophisticated AI systems, including the long-term goal of Artificial General Intelligence (AGI). By focusing on these challenges, researchers and developers can contribute to the creation of AI systems that are not only more intelligent but also more reliable and aligned with human needs and ethical considerations. Future research directions may include: As AI continues to evolve, understanding and overcoming these reasoning limitations will be crucial in shaping the future of intelligent systems. This research from Apple not only highlights current shortcomings but also opens new avenues for innovation in AI development, potentially leading to more capable, reliable, and truly intelligent AI systems in the future.

[5]

AppleInsider

Reasoning failures highlighted by Apple research on LLMs

Apple plans to introduce its own version of AI starting with iOS 18.1 - image credit Apple A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills. The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen. "Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases." The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded. A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter. The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday." The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?" The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result. The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers. "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."

[6]

CCN.com

Apple Researchers Suggest 'Fragile' AI Reasoning Capabilities Are Overstated

This suggests that AI models rely on "sophisticated pattern matching more than true logical reasoning" they concluded. According to commonly used benchmarks, frontier large language models (LLMs) have now surpassed the average human's ability to solve mathematical problems and perform complex reasoning. For instance, OpenAI's o1 model recently outperformed human experts on PhD-level science questions. However, a group of Apple researchers (Mirzadeh et al.) have recently highlighted a major flaw in the way AI performance is assessed. By changing the phrasing of the questions just a tiny bit, leading models from OpenAI, Google, Anthropic and Meta saw their ability to answer questions correctly collapse. The Limitations of AI Benchmarks Standardized AI benchmarks make it possible to compare different models' performance. However, if AI developers only measure intelligence using a limited set of benchmarks, they risk creating models that perform exceedingly well on a finite same of predetermined tasks but flounder in the wild. To explore the issue, Mirzadeh et al. modified the commonly used GSM8K benchmark-a set of 8,500 grade school math word problems. The researchers found that even superficial changes such as switching names negatively impacted model performance. When they changed the values, performance dropped more notably. The most significant decrease occurred when they rephrased the question entirely. For example, adding a single irrelevant clause caused performance to decline by up to 65%. Interestingly, the researchers observed this "fragility of mathematical reasoning" across all models they tested, including so-called chain-of-thought (CoT) models like OpenAI's o1 that are meant to be capable of complex reasoning. The Rise of Chain-of-Thought Chain-of-thought first emerged as a form of prompt engineering that breaks down complex prompts into a series of intermediate steps. Although the technique was honed as an additional stage developers could apply to LLM prompts, some models now incorporate CoT into their architecture. With CoT baked in, OpenAI's o1 is much more capable of complex reasoning than its predecessors. The model's lead developer Lukasz Kaiser has argued that the new design approach represents a shift for LLMs that will lead to more concrete logical processes. Yet, for all its apparent advancements, o1 was subject to the same fragile reasoning the Apple researchers observed in other models. AI Still Incapable of Formal Reasoning Despite major performance gains, the researchers concluded that even the most sophisticated LLM operations "resemble sophisticated pattern matching more than true logical reasoning". Nevertheless, their findings do suggest that CoT-based approaches are moving in the right direction. Of all the models assessed, o1 experienced the smallest performance decline between the regular GSM8K questions and the modified ones. In other words, although its reasoning was found to be fragile, it is less fragile than other models.

[7]

MacRumors

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

Apple's AI research team has uncovered significant weaknesses in the reasoning abilities of large language models, according to a newly published study. The study, published on arXiv, outlines Apple's evaluation of a range of leading language models, including those from OpenAI, Meta, and other prominent developers, to determine how well these models could handle mathematical reasoning tasks. The findings reveal that even slight changes in the phrasing of questions can cause major discrepancies in model performance that can undermine their reliability in scenarios requiring logical consistency. Apple draws attention to a persistent problem in language models: their reliance on pattern matching rather than genuine logical reasoning. In several tests, the researchers demonstrated that adding irrelevant information to a question -- details that should not affect the mathematical outcome -- can lead to vastly different answers from the models. One example given in the paper involves a simple math problem asking how many kiwis a person collected over several days. When irrelevant details about the size of some kiwis were introduced, models such as OpenAI's o1 and Meta's Llama incorrectly adjusted the final total, despite the extra information having no bearing on the solution. We found no evidence of formal reasoning in language models. Their behavior is better explained by sophisticated pattern matching -- so fragile, in fact, that changing names can alter results by ~10%. This fragility in reasoning prompted the researchers to conclude that the models do not use real logic to solve problems but instead rely on sophisticated pattern recognition learned during training. They found that "simply changing names can alter results," a potentially troubling sign for the future of AI applications that require consistent, accurate reasoning in real-world contexts. According to the study, all models tested, from smaller open-source versions like Llama to proprietary models like OpenAI's GPT-4o, showed significant performance degradation when faced with seemingly inconsequential variations in the input data. Apple suggests that AI might need to combine neural networks with traditional, symbol-based reasoning called neurosymbolic AI to obtain more accurate decision-making and problem-solving abilities.

[8]

Ars Technica

LLMs can't perform "genuine logical reasoning," Apple researchers suggest

What is going on inside that anthropomorphized digital brain? Credit: Getty Images For a while now, companies like OpenAI and Google have been touting advanced "reasoning" capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical "reasoning" displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems. The fragility highlighted in these new results helps support previous research suggesting that LLMs use of probabilistic pattern matching is missing the formal understanding of underlying concepts needed for truly reliable mathematical reasoning capabilities. "Current LLMs are not capable of genuine logical reasoning," the researchers hypothesize based on these results. "Instead, they attempt to replicate the reasoning steps observed in their training data." In "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" -- currently available as a pre-print paper -- the six Apple researchers start with GSM8K's standardized set of over 8,000 grade-school level mathematical word problems, which is often used as a benchmark for modern LLMs' complex reasoning capabilities. They then take the novel approach of modifying a portion of that testing set to dynamically replace certain names and numbers with new values -- so a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic evaluation. This approach helps avoid any potential "data contamination" that can result from the static GSM8K questions being fed directly into an AI model's training data. At the same time, these incidental changes don't alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K. Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values. Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names.

[9]

Analytics India Magazine

Apple Proves OpenAI o1 is Actually Good at Reasoning

While some say LLMs are our ticket to AGI, others think they're just glorified text-producing algorithms with a fancy name. Apple has gotten better at gaslighting AI companies that are spending all they have on making LLMs better at reasoning. A research team of six people at Apple recently published a paper titled - Understanding the Limitations of Mathematical Reasoning in Large Language Models - which basically said that the current LLMs can't reason. "...current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data," reads the paper, which also includes LLMs like OpenAI's GPT-4o and even the much-touted "thinking and reasoning" LLM, o1. The research was done on a series of other models as well, such as Llama, Phi, Gemma, and Mistral. Mehrdad Farajtabar, the senior author of the paper, posted on X explaining how the team came to the conclusion. According to him, LLMs just follow sophisticated patterns and even models smaller than 3 billion parameters are hitting benchmarks that only larger ones could do earlier, specifically the GSM8K score released by OpenAI three years ago. The researchers introduced GSM-Symbolic, a new tool for testing mathematical reasoning within LLMs because GSM8K was not accurate enough and thus, not reliable for testing the reasoning abilities of LLMs. Surprisingly, on this benchmark, OpenAI's o1 demonstrated "strong performance on various reasoning and knowledge-based benchmarks" according to the researchers, but the capabilities dropped by 30% when the researchers introduced the GSM-NoOp experiment, which involved adding irrelevant information to the questions. This proves that the "reasoning" capabilities of OpenAI's models are definitely getting better, and maybe GPT-5 would be a lot better. Maybe it's just Apple's LLMs that don't reason well, but the team didn't test out Apple's model. Also, not everyone is happy with the research paper as it fails to even explain what "reasoning" actually means and just introduces a new benchmark for evaluating LLMs. "Overall, we found no evidence of formal reasoning in language models...their behaviour is better explained by sophisticated pattern matching -- so fragile, in fact, that changing names can alter results by ~10%!" Farajtabar further added that scaling these models would just result in 'better pattern machines' but not 'better reasoners'. Some people have been making this claim all along that LLMs cannot reason and they are an off road to AGI. Possibly, Apple has finally accepted this after trying out LLMs on their products and this is possibly also one of the reasons why it backed out of its investment in OpenAI. Most of the researchers have been praising the paper by Apple and believe that it is important that others also accept that LLMs cannot reason. Gary Marcus, a long-standing critic of LLMs, also shared several examples of LLMs not able to perform reasoning tasks such as calculation and being better at Chess. On the other hand, a problem with Apple's paper is that it has confused reasoning with computation. "Reasoning is knowing an algorithm to solve a problem, not solving all of it in your head," said Paras Chopra, an AI researcher, while explaining that most of the LLMs know the approach to solving a problem even though they end up with the wrong answer in the end. According to him, knowing the approach is good enough to check if the LLM is reasoning even if the answer is wrong. Discussions on Hacker News highlight that some of the questions that the Apple researchers asked LLMs were trying to do a "gotcha!" on them, as they included irrelevant information in questions, which LLMs would not be able to actively filter out. Reasoning is the progressive, iterative reduction of informational entropy in a knowledge domain. OpenAI's o1-preview does that better by introducing iteration. It's not perfect, but it does it. Subbarao Kambhampati, a computer science and AI professor at ASU, agreed that some of the claims of LLMs being capable of reasoning are exaggerated. However, he said that LLMs require more tools to handle System 2 tasks (reasoning), for which techniques like 'fine-tuning' or 'Chain of Thought' are not adequate. When OpenAI released o1, claiming that the model thinks and reasons, Clem Delangue, the CEO of Hugging Face, was not impressed. "Once again, an AI system is not 'thinking', it's 'processing', 'running predictions',... just like Google or computers do," said Delangue, when talking about how OpenAI is painting the false picture of what the company's newest model can achieve. While some agreed, others argued that it is exactly how human brains work as well. "Once again, human minds aren't 'thinking' they are just executing a complex series of bio-chemical / bio-electrical computing operations at massive scale," replied Phillip Rhodes to Delangue. To test reasoning, some people also ask LLMs how many Rs are there in the word 'Strawberry', which does not make sense at all. LLMs can't count letters directly because they process text in chunks called "tokens". The tests for reasoning have been problematic in the case of LLMs ever since they were created. Everyone seems to have strong opinions on LLMs. While some are grounded in research by experts such as Yann LeCun or Francois Chollet arguing that LLM research should be taken a bit more seriously, others just follow the hype and criticise it. Some say they're our ticket to the AGI, while others think they're just glorified text-producing algorithms with a fancy name. Meanwhile, Andrej Karpathy recently said that the technique of predicting the next token that these LLMs, or Transformers, use might be able to solve a lot of problems outside the realm of where it is being used right now. While it seems true to some extent that LLMs can reason, when it comes to putting them to the test, they end up failing it.

[10]

TechCrunch

Researchers question AI's 'reasoning' ability as models stumble on math problems with trivial changes

How do machine learning models do what they do? And are they really "thinking" or "reasoning" the way we understand those things? This is a philosophical question as much as a practical one, but a new paper making the rounds Friday suggests that the answer is, at least for now, a pretty clear "no." A group of AI research scientists at Apple released their paper, "Understanding the limitations of mathematical reasoning in large language models," to general commentary Thursday. While the deeper concepts of symbolic learning and pattern reproduction are a bit in the weeds, the basic concept of their research is very easy to grasp. Let's say I asked you to solve a simple math problem like this one: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have? Obviously, the answer is 44 + 58 + (44 * 2) = 190. Though large language models are actually spotty on arithmetic, they can pretty reliably solve something like this. But what if I threw in a little random extra info, like this: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have? It's the same math problem, right? And of course even a grade-schooler would know that even a small kiwi is still a kiwi. But as it turns out, this extra data point confuses even state-of-the-art LLMs. Here's GPT-o1-mini's take: ... on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday's kiwis) - 5 (smaller kiwis) = 83 kiwis This is just a simple example out of hundreds of questions that the researchers lightly modified, but nearly all of which led to enormous drops in success rates for the models attempting them. Now, why should this be? Why would a model that understands the problem be thrown off so easily by a random, irrelevant detail? The researchers propose that this reliable mode of failure means the models don't really understand the problem at all. Their training data does allow them to respond with the correct answer in some situations, but as soon as the slightest actual "reasoning" is required, such as whether to count small kiwis, they start producing weird, unintuitive results. As the researchers put it in their paper: [W]e investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. This observation is consistent with the other qualities often attributed to LLMs due to their facility with language. When, statistically, the phrase "I love you" is followed by "I love you, too," the LLM can easily repeat that -- but it doesn't mean it loves you. And although it can follow complex chains of reasoning it has been exposed to before, the fact that this chain can be broken by even superficial deviations suggests that it doesn't actually reason so much as replicate patterns it has observed in its training data. Mehrdad Farajtabar, one of the co-authors, breaks down the paper very nicely in this thread on X. An OpenAI researcher, while commending Mirzadeh et al's work, objected to their conclusions, saying that correct results could likely be achieved in all these failure cases with a bit of prompt engineering. Farajtabar (responding with the typical yet admirable friendliness researchers tend to employ) noted that while better prompting may work for simple deviations, the model may require exponentially more contextual data in order to counter complex distractions -- ones that, again, a child could trivially point out. Does this mean that LLMs don't reason? Maybe. That they can't reason? No one knows. These are not well-defined concepts, and the questions tend to appear at the bleeding edge of AI research, where the state of the art changes on a daily basis. Perhaps LLMs "reason," but in a way we don't yet recognize or know how to control. It makes for a fascinating frontier in research, but it's also a cautionary tale when it comes to how AI is being sold. Can it really do the things they claim, and if it does, how? As AI becomes an everyday software tool, this kind of question is no longer academic.

Twitter

Facebook

Copy Link

A new study by Apple researchers challenges the perceived intelligence of large language models, showing significant drops in performance when faced with slightly altered problems.

Apple's Research Challenges AI Reasoning Capabilities

A recent study conducted by Apple researchers has cast doubt on the perceived intelligence of large language models (LLMs), including those developed by tech giants like OpenAI, Google, and Meta. The research suggests that these AI models may be relying more on sophisticated pattern matching rather than true logical reasoning 1.

The GSM8K Dataset and Its Limitations

The study focused on the widely-used GSM8K (Grade School Math 8K) dataset, a benchmark consisting of over 8,000 grade-school level math word problems. While LLMs have shown impressive performance on this dataset, Apple researchers argue that this success may be due to data contamination rather than genuine problem-solving abilities 2.

Introduction of GSM-Symbolic Benchmark

To address these concerns, Apple developed a new benchmark called GSM-Symbolic. This test retains the core reasoning elements of the GSM8K dataset but introduces changes such as different names, numbers, and complexity, along with irrelevant information 3.

Significant Performance Drops

When tested on the GSM-Symbolic benchmark, every LLM, including advanced models like OpenAI's GPT-4 and Meta's Llama 3, experienced substantial decreases in performance. The study revealed:

A few percentage points drop in accuracy when names and variables were changed.
Up to 65% decrease in performance when irrelevant information was added to problems 4.

Implications for AI Development

These findings have significant implications for the AI industry:

Current LLMs struggle with true reasoning when variables are altered, questioning their actual problem-solving skills.
The research underscores the limitations of relying on benchmarks like GSM8K to assess AI intelligence.
It highlights the need for developing AI models capable of formal reasoning, moving beyond pattern recognition 5.

Industry Reactions and Future Directions

The study aligns with statements from industry leaders like Sam Altman, CEO of OpenAI, who has referred to current LLMs as "incredibly dumb" despite their impressive outward appearance. This research emphasizes the importance of focusing on safer, smarter AI systems that can handle more than just pattern recognition 1.

As AI continues to evolve, addressing these reasoning limitations will be crucial in shaping the future of intelligent systems. The findings from Apple's study offer a sobering perspective on the current state of LLMs and highlight the need for further research and development to ensure that AI models move beyond surface-level intelligence and develop true logical reasoning skills.

Reference

[1]

Dataconomy

|Apple says a high score on GSM8K dataset does not mean your AI is smarter

[2]

LaptopMag

|Apple agrees with Sam Altman: AI is incredibly dumb

[3]

Mashable

|Apple study reveals major AI flaw in OpenAI, Google, and Meta LLMs

[4]

Geeky Gadgets

|Apple's Shocking AI Revelation: Are Language Models Just Pattern Machines?

[5]

AppleInsider

|Reasoning failures highlighted by Apple research on LLMs

The Paradox of AI Advancement: Larger Models More Prone to Misinformation

Recent studies reveal that as AI language models grow in size and sophistication, they become more likely to provide incorrect information confidently, raising concerns about reliability and the need for improved training methods.

3 Sources

Larger AI Models Show Improved Performance but Increased Confidence in Errors, Study Finds

Recent research reveals that while larger AI language models demonstrate enhanced capabilities in answering questions, they also exhibit a concerning trend of increased confidence in incorrect responses. This phenomenon raises important questions about the development and deployment of advanced AI systems.

5 Sources

Apple Unveils New AI Technique to Enhance User Experience and Compete with ChatGPT

Apple introduces a novel AI approach called 'HUGS' to improve user interactions and challenge ChatGPT's dominance. This technique aims to enhance Apple's AI capabilities across its product lineup.

2 Sources

AI-Generated Content Threatens Accuracy of Large Language Models

Researchers warn that the proliferation of AI-generated web content could lead to a decline in the accuracy and reliability of large language models (LLMs). This phenomenon, dubbed "model collapse," poses significant challenges for the future of AI development and its applications.

8 Sources

OpenAI's O1 Model: A Breakthrough in AI Reasoning

OpenAI's latest model, O1, represents a significant advancement in AI technology, demonstrating human-like reasoning capabilities. This development could revolutionize various industries and spark new ethical considerations.

3 Sources

Your one-stop AI hub

The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.

Creative and design

Apple Research Reveals Limitations in AI Models' Reasoning Abilities

10 Sources