Apple Study Challenges AI Reasoning Capabilities, Casting Doubt on AGI Claims

Reviewed by Nidhi Govil

22 Sources

[1]

Ars Technica

New Apple study challenges whether AI models truly "reason" through problems

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI's o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs. The new study, titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The researchers examined what they call "large reasoning models" (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called "chain-of-thought reasoning" that ostensibly assists with solving problems in a step-by-step fashion. To do that, they pitted the AI models against four classic puzzles -- Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks) -- scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves). "Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy," the researchers write. In other words, today's tests only care if the model gets the right answer to math or coding problems that may already be in its training data -- they don't examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before. Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning. Known skeptics and new evidence AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results "pretty devastating to LLMs." While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism. "It is truly embarrassing that LLMs cannot reliably solve Hanoi," Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve -- a finding that study co-lead Iman Mirzadeh argued shows "their process is not logical and intelligent." The Apple team found that simulated reasoning models behave differently from "standard" models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would "overthink" and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models' methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given. The researchers also identified what they call a "counterintuitive scaling limit." As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources. The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle -- despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational. Competing interpretations emerge However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities. "If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I'll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL'd to do," Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation. Bryan suggests that unspecified industry benchmarks show "performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried," but notes that deployed models intentionally limit this to prevent "overthinking" simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits. Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 "immediately decides 'generating all those moves manually is impossible,' because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails." Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it. Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach as "not exactly a sensible way to apply LLMs, with or without reasoning," and suggesting the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its "irresistible headline" about Apple claiming LLMs don't reason. The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that "puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems." The paper also acknowledges that reasoning models show improvements in the "medium complexity" range and continue to demonstrate utility in some real-world applications. Implications remain contested Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily. What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods. As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models' general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility. Apple's results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs. However, that doesn't mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, "At least for the next decade, LLMs (with and without inference time "reasoning") will continue have their uses, especially for coding and brainstorming and writing."

[2]

New Scientist

Is superintelligent AI just around the corner, or just a sci-fi dream?

Tech CEOs are promising increasingly outlandish visions of the 2030s, powered by "superintelligence", but the reality is that even the most advanced AI models can still struggle with simple puzzles If you take the leaders of artificial intelligence companies at their word, their products mean that the coming decade will be quite unlike any in human history: a golden era of "radical abundance", where high-energy physics is "solved" and we see the beginning of space colonisation. But researchers working with today's most powerful AI systems are finding a different reality, in which even the best models are failing to solve basic puzzles that most humans find trivial, while the promise of AI that can "reason" seems to be overblown. So, whom should you believe? Sam Altman and Demis Hassabis, the CEOs of OpenAI and Google DeepMind, respectively, have both made recent claims that powerful, world-altering AI systems are just around the corner. In a blog post, Altman writes that "the 2030s are likely going to be wildly different from any time that has come before", speculating that we might go "from a major materials science breakthrough one year to true high-bandwidth brain-computer interfaces the next year". Hassabis, in an interview with Wired, also said that in the 2030s, artificial general intelligence (AGI) will start to solve problems like "curing terrible diseases", leading to "much healthier and longer lifespans," as well as finding new energy sources. "If that all happens," said Hassabis in the interview, "then it should be an era of maximum human flourishing, where we travel to the stars and colonize the galaxy." This vision relies heavily on the assumption that large language models (LLMs) like ChatGPT get more capable the more training data and computer power we throw at them. This "scaling law" seems to have held true for the past few years, but there have been hints of it faltering. For example, OpenAI's recent GPT-4.5 model, which likely cost hundreds of millions of dollars to train, achieved only modest improvements over its predecessor GPT-4. And that cost is nothing compared with future spending, with reports suggesting that Meta is about to announce a $15 billion investment in an attempt to achieve "superintelligence". Money isn't the only attempted solution to this problem, however - AI firms have also turned to "reasoning" models, like OpenAI's o1, which was released last year. These models use more computing time and so take longer to produce a response, feeding their own outputs back into themselves. This iterative process has been labelled "chain-of-thought", in an effort to draw comparisons to the way a person might think through problems step by step. "There were legitimate reasons to be concerned about AI plateauing," Noam Brown at OpenAI told New Scientist last year, but o1 and models like it meant that the "scaling law" could continue, he argued. Yet recent research has found these reasoning models can stumble on even simple logic puzzles. For example, researchers at Apple tested Chinese AI company DeepSeek's reasoning models and Anthropic's Claude thinking models, which work like OpenAI's o1-family of models. The researchers found they have "limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles", the researchers wrote. The team tested the AI on several puzzles, such as a scenario in which a person has to transport items across a river in the fewest number of steps, and Tower of Hanoi, a game where you must move rings one by one between three poles without placing a larger ring on top of a smaller one. Though the models could solve the puzzles at their easiest settings, they struggled with increasing the number of rings or items to transport. While we would spend a longer time thinking about a more complex problem, the researchers found that the AI models used fewer "tokens" - chunks of information - as the complexity of the problems increased, suggesting that the "thinking" time the models displayed is an illusion. "The damaging part is that these are tasks easily solvable," says Artur Garcez at City, University of London. "We already knew 50 years ago how to use symbolic AI reasoning to solve these." It is possible that these newer systems can be fixed and improved to eventually be able to reason through complex problems, but this research shows it's unlikely to happen purely through increasing the size of the models or the computational resources given to them, says Garcez. It is also a reminder that these models still struggle to solve scenarios they haven't seen outside of their training data, says Nikos Aletras at the University of Sheffield. "They work quite well actually in many cases, like finding, collating information and then summarising it, but these models have been trained to do these kinds of things, and it appears magic, but it isn't - they have been trained to do this," says Aletras. "Now, I think the Apple research has found a blind spot." Meanwhile, other research is showing that increased "thinking" time can actually hurt an AI model's performance. Soumya Suvra Ghosal and his colleagues at the University of Maryland tested DeepSeek's models and found that longer "chain of thought" processes led to a decreased accuracy on tests of mathematical reasoning. For example, for one mathematical benchmark, they found that tripling the amount of tokens used by a model can increase its performance by about 5 per cent. But using 10 to 15 times as many tokens again dropped the benchmark score by around 17 per cent. In some cases, it appears the "chain of thought" output produced by an AI bears little relation to the eventual answer it provides. When testing DeepSeek's models on the ability to navigate simple mazes, Subbarao Kambhampati at Arizona State University and his colleagues found that even when the AI solved the problem, its "chain of thought" output contained mistakes that weren't reflected in the final solution. What's more, feeding the AI a meaningless "chain of thought" could actually produce better answers. "Our results challenge the prevailing assumption that intermediate tokens or 'chains of thought' can be semantically interpreted as the traces of internal reasoning of the AI models, and caution against anthropomorphising them that way," says Kambhampati. Indeed, all of the studies suggest that "thinking" or "reasoning" labels for these AI models are a misnomer, says Anna Rogers at the IT University of Copenhagen in Denmark. "For as long as I've been in this field, every popular technique I can think of has been first hyped up with some vague cognitively-sounding analogy, which [was] then eventually proved wrong." Andreas Vlachos at the University of Cambridge points out that LLMs still have clear applications in text generation and other tasks, but says the latest research suggests we may struggle to ever make them tackle the kind of complex problems Altman and Hassabis have promised will be solved in just a few years. "Fundamentally, there is a mismatch between what these models are trained to do, which is next-word prediction, as opposed to what we are trying to get them to do, which is to produce reasoning," says Vlachos. OpenAI disagrees, however. "Our work shows that reasoning methods like chain-of-thought can significantly improve performance on complex problems, and we're actively working to expand these capabilities through better training, evaluation, and model design," says a spokesperson. DeepSeek didn't respond to a request for comment.

[3]

Tom's Hardware

Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models

Apple researchers have tested advanced AI reasoning models -- which are called large reasoning models (LRM) -- in controlled puzzle environments and found that while they outperform 'standard' large language models (LLMs) models on moderately complex tasks, both fail completely as complexity increases. The researchers from Apple, which is not exactly at the forefront of AI development, believe that the current LRMs and LLMs have fundamental limits in their ability to generalize reasoning, or rather thinking the way humans do. Apple researchers studied how advanced AI models -- the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs -- handle increasingly complex problem-solving tasks. They moved beyond standard math and coding benchmarks and designed controlled puzzle environments, such as Tower of Hanoi and River Crossing, where they could precisely adjust problem complexity. Their goal was to evaluate not just final answers but also the internal reasoning processes of these models, comparing them to standard large language models under equal computational conditions. Through the puzzles, they aimed to uncover the true strengths and fundamental limits of AI reasoning. Apple researchers discovered that LRMs perform differently depending on problem complexity. On simple tasks, standard LLMs, without explicit reasoning mechanisms, were more accurate and efficient and delivered better results with fewer compute resources. However, as problem complexity increased to a moderate level, models equipped with structured reasoning, like Chain-of-Thought prompting, gained the advantage and outperformed their non-reasoning counterparts. When the complexity grew further, both types of models failed completely: their accuracy dropped to zero regardless of the available compute resources. (Keep in mind that the the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs have limitations when it comes to their training.) A deeper analysis of the reasoning traces revealed inefficiencies and unexpected behavior. Initially, reasoning models used longer thought sequences as problems became harder, but near the failure point, they surprisingly shortened their reasoning effort even when they had sufficient compute capacity left. Moreover, even when explicitly provided with correct algorithms, the models failed to reliably execute step-by-step instructions on complex tasks, exposing weaknesses in logical computation. The study also found that model performance varied significantly between familiar and less-common puzzles, suggesting that success often depended on training data familiarity rather than true generalizable reasoning skills.

[4]

The Register

Apple AI boffins pour cold water on reasoning models

If you are betting on AGI - artificial general intelligence, the point at which AI models rival human cognition - showing up next year, you may want to adjust your timeline. Apple AI researchers have found that the "thinking" ability of so-called "large reasoning models" collapses when things get complicated. The authors' findings, described in a paper titled, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," indicate that the intellectual potential of such models is so far quite limited. Large reasoning models (LRMs), such as OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, are designed to break problems down into smaller steps. Instead of responding to a prompt with a specific prediction, they use mechanisms like Chain of Thought to iterate through a series of steps, validating their intermediate answers along the way, to arrive at a solution to the stated problem. Authors Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar set out to test how these reasoning models perform. So they designed a puzzle environment for the models as an alternative to applying standard benchmark tests. The puzzle regime gave the researchers control over the complexity of the challenges while avoiding benchmark data contamination, a problem that arises when language models inadvertently absorb evaluation benchmarks during training, skewing their performance in testing. Some model makers have also been accused of gaming benchmarks, which just aren't all that great to begin with. The puzzle environment included various games like the Tower of Hanoi, in which the goal is to stack a set of differently sized disks in order of size by moving them one at a time between three upright pegs. The researchers found reasoning models did better with moderately complex problems, but broke down at a certain level of complexity. "[D]espite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity threshold," the paper says. Reasoning models also underperformed simple large language models on easier problems - they often found the correct solution early but kept looking, inefficiently burning compute on unnecessary steps. The authors argue that the results suggest large reasoning models may not provide a path toward better artificial thinking. "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the authors conclude. ®

[5]

Live Science

AI reasoning models aren't as smart as they were cracked up to be, Apple study claims

AI reasoning models could have fundamental limitations in their ability to solve problems. (Image credit: Getty Images) Artificial intelligence (AI) reasoning models aren't as smart as they've been made out to be. In fact, they don't actually reason at all, researchers at Apple say. Reasoning models, such as Meta's Claude, OpenAI's o3 and DeepSeek's R1, are specialized large language models (LLMs) that dedicate more time and computing power to produce more accurate responses than their traditional predecessors. The rise of these models has led to renewed claims from big tech firms that they could be on the verge of developing machines with artificial general intelligence (AGI) -- systems that outperform humans at most tasks. Yet a new study, published June 7 on Apple's Machine Learning Research website, has responded by landing a major blow against the company's competitors. Reasoning models don't just fail to show generalized reasoning, the scientists say in the study, their accuracy completely collapses when tasks get too complex. "Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities," the researchers wrote in the study. "Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget." LLMs grow and learn by absorbing training data from vast quantities of human output. Drawing upon this data enables models to generate probabilistic patterns from their neural networks by feeding them forward when given a prompt. Related: AI 'hallucinates' constantly, but there's a solution Reasoning models are an attempt to further boost AI's accuracy using a process known as "chain-of-thought." It works by tracing patterns through this data using multi-step responses, mimicking how humans might deploy logic to arrive at a conclusion. This gives the chatbots the ability to reevaluate their reasoning, enabling them to tackle more complex tasks with greater accuracy. During the chain-of-thought process, models spell out their logic in plain language for every step they take so that their actions can be easily observed. However, as this process is rooted in statistical guesswork instead of any real understanding, chatbots have a marked tendency to 'hallucinate' -- throwing out erroneous responses, lying when their data doesn't have the answers, and dispensing bizarre and occasionally harmful advice to users. An OpenAI technical report has highlighted that reasoning models are much more likely to be derailed by hallucinations than their generic counterparts, with the problem only getting worse as models advance. When tasked with summarizing facts about people, the company's o3 and o4-mini models produced erroneous information 33% and 48% of the time, respectively, compared to the 16% hallucination rate of its earlier o1 model. OpenAI representatives said they don't know why this is happening, concluding that "more research is needed to understand the cause of these results." "We believe the lack of systematic analyses investigating these questions is due to limitations in current evaluation paradigms," the authors wrote in Apple's new study. "Existing evaluations predominantly focus on established mathematical and coding benchmarks, which, while valuable, often suffer from data contamination issues and do not allow for controlled experimental conditions across different settings and complexities. Moreover, these evaluations do not provide insights into the structure and quality of reasoning traces." To delve deeper into these issues, the authors of the new study set generic and reasoning bots -- which include OpenAI's o1 and o3 models, DeepSeek R1, Anthropic's Claude 3.7 Sonnet, Google's Gemini -- four classic puzzles to solve (river crossing, checker jumping, block-stacking, and The Tower of Hanoi). They were then able to adjust the puzzles' complexity between low, medium and high by adding more pieces to them. For the low-complexity tasks, the researchers found that generic models had the edge on their reasoning counterparts, solving problems without the additional computational costs introduced by reasoning chains. As tasks became more complex, the reasoning models gained an advantage, but this didn't last when faced with highly complex puzzles, as the performance of both models "collapsed to zero." Upon passing a critical threshold, reasoning models reduced the tokens (the fundamental building blocks models break data down into) they assigned to more complex tasks, suggesting that they were reasoning less and had fundamental limitations in maintaining chains-of-thought. And the models continued to hit these snags even when given solutions. "When we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve," the authors wrote in the study. "Moreover, investigating the first failure move of the models revealed surprising behaviours. For instance, they could perform up to 100 correct moves in the Tower of Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle." The findings point to models relying more heavily on pattern recognition, and less on emergent logic, than those who herald imminent machine intelligence claim. But the researchers do highlight key limitations to their study, including that the problems only represent a "narrow slice" of the potential reasoning tasks that the models could be assigned. Apple also has a lagging horse in the AI race. The company is trailing its rivals with Siri being found by one analysis to be 25% less accurate than ChatGPT at answering queries, and is instead prioritizing development of on-device, efficient AI over large reasoning models. This has inevitably led some to accuse Apple of sour grapes. "Apple's brilliant new AI strategy is to prove it doesn't exist," Pedros Domingos, a professor emeritus of computer science and engineering at the University of Washington, wrote jokingly on X. Nonetheless, some AI researchers have heralded the study as a necessary heaping of cold water on grandiose claims about our current AI's ability to one day become superintelligent. "Apple did more for AI than anyone else: they proved through peer-reviewed publications that LLMs are just neural networks and, as such, have all the limitations of other neural networks trained in a supervised way, which I and a few other voices tried to convey, but the noise from a bunch of AGI-feelers and their sycophants was too loud," Andriy Burkov, an AI expert and former machine learning team leader at research advisory firm Gartner, wrote on X. "Now, I hope, the scientists will return to do real science by studying LLMs as mathematicians study functions and not by talking to them as psychiatrists talk to sick people."

[6]

TechSpot

AI flunks logic test: Multiple studies reveal illusion of reasoning

Bottom line: More and more AI companies say their models can reason. Two recent studies say otherwise. When asked to show their logic, most models flub the task - proving they're not reasoning so much as rehashing patterns. The result: confident answers, but not intelligent ones. Apple researchers have uncovered a key weakness in today's most hyped AI systems - they falter at solving puzzles that require step-by-step reasoning. In a new paper, the team tested several leading models on the Tower of Hanoi, an age-old logic puzzle, and found that performance collapsed as complexity increased. The Tower of Hanoi puzzle is simple: move a stack of disks from one peg to another while following rules about order and disk size. For humans, it's a classic test of planning and recursive logic. For language models trained to predict the next token, the challenge lies in applying fixed constraints across multiple steps without losing track of the goal. Apple's researchers didn't just ask the models to solve the puzzle - they asked them to explain their steps. While most handled two or three disks, their logic unraveled as the disk count rose. Models misstated rules, contradicted earlier steps, or confidently made invalid moves - even with chain-of-thought prompts. In short, they weren't reasoning - they were guessing. These findings echo a study from April when researchers at ETH Zurich and INSAIT tested top AI models on problems from the 2025 USA Mathematical Olympiad - a competition requiring full written proofs. Out of nearly 200 attempts, none produced a perfect solution. One of the stronger performers, Google's Gemini 2.5 Pro, earned 24 percent of the total points - not by solving 24 percent of problems, but through partial credits on each attempt. OpenAI's o3-mini barely cleared 2 percent. The models didn't just miss answers - they made basic errors, skipped steps, and contradicted themselves while sounding confident. In one problem, a model started strong but excluded valid cases without explanation. Others invented constraints based on training quirks, such as always boxing final answers - even when it didn't fit the context. Gary Marcus, a longtime critic of AI hype, called Apple's findings "pretty devastating to large language models." "It is truly embarrassing that LLMs cannot reliably solve Hanoi," he wrote. "If you can't use a billion dollar AI system to solve a problem that Herb Simon one of the actual 'godfathers of AI,' solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote." Even when given explicit algorithms, model performance didn't improve. The study's co-lead Iman Mirzadeh put it bluntly: "Their process is not logical and intelligent." The results suggest what looks like reasoning is often just pattern matching - statistically fluent but not grounded in logic. Not all experts were dismissive. Sean Goedecke, a software engineer specializing in AI systems, saw the failure as revealing. "The model immediately decides 'generating all those moves manually is impossible,' because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails," he wrote in his analysis of the Apple study. "The key insight here is that past a certain complexity threshold, the model decides that there's too many steps to reason through and starts hunting for clever shortcuts. So past eight or nine disks, the skill being investigated silently changes from 'can the model reason through the Tower of Hanoi sequence?' to 'can the model come up with a generalized Tower of Hanoi solution that skips having to reason through the sequence?'" Rather than proving models are hopeless at reasoning, Goedecke suggested the findings highlight how AI systems adapt their behavior under pressure - sometimes cleverly, sometimes not. The failure isn't just in step-by-step reasoning but in abandoning the task when it becomes too unwieldy. Tech companies often highlight simulated reasoning as a breakthrough. The Apple paper confirms that even models fine-tuned for chain-of-thought reasoning tend to hit a wall once cognitive load grows - for example, when tracking moves beyond six disks in Tower of Hanoi. The models' internal logic unravels, with some only managing partial success by mimicking rational explanations. Few display a consistent grasp of cause and effect or goal-directed behavior. The results of the Apple and ETH Zurich studies stand in stark contrast to how companies market these models - as capable reasoners able to handle complex, multi-step tasks. In practice, what passes for reasoning is often just advanced autocomplete with extra steps. The illusion of intelligence arises from fluency and formatting, not true insight. The Apple paper stops short of proposing sweeping fixes. However, it aligns with growing calls for hybrid approaches that combine large language models with symbolic logic, verifiers, or task-specific constraints. These methods may not make AI truly intelligent, but they could help prevent confidently wrong answers from being presented as facts. Until such advances materialize, simulated reasoning is likely to remain what the name implies: simulated. It is useful - sometimes impressive - but far from genuine intelligence.

[7]

9to5Mac

Approaching WWDC, Apple researchers dispute claims that AI is capable of reasoning

While Apple has fallen behind the curve in terms of the AI features the company has actually launched, its researchers continue to work at the cutting edge of what's out there. In a new paper, they take issue with claims being made about some of the latest AI models - that they are actually capable of step-by-step reasoning. Apple say its tests show that this simply isn't true ... While it's acknowledged that conventional generative AI models, aka Large Language Models (LLMs), have no ability to reason, some AI companies are claiming that a new generation of models can. These are being referred to as Large Reasoning Models (LRMs). These grew out of attempts to have LLMs "show their work" - that is, lay out the individual steps taken to reach their conclusions. The idea is that if an AI can be forced to develop a chain of thought, and to take things one step at a time, that will stop them either making things up entirely or going off the rails at some point in their claims. Some big claims are being made for this approach, but a new Apple research paper calls this "the illusion of thinking." They argue that testing a range of LRMs shows that their "reasoning" quickly falls apart even with relatively simple logic challenges that are easy to solve algorithmically, like the Tower of Hanoi puzzle. Tower of Hanoi is a puzzle featuring three pegs and n disks of different sizes stacked on the first peg in size order (largest at bottom). The goal is to transfer all disks from the first peg to the third peg. Valid moves include moving only one disk at a time, taking only the top disk from a peg, and never placing a larger disk on top of a smaller one. You can create simpler or more complex versions of the game by varying the number of disks. What they found is that LRMs are actually worse than LLMs at the simplest versions of the puzzle, are slightly but not dramatically better when more discs are added - then fail completely with more than eight disks. Simple problems (N=1-3) show early accuracy declining over time (overthinking), moderate problems (N=4-7) show slight improvement in accuracy with continued reasoning, and complex problems (N≥8) exhibit consistently near-zero accuracy, indicating complete reasoning failure, meaning that the model fails to generate any correct solutions within the thought. In fact, they demonstrated that LRMs fail even when you give them the algorithm needed to solve it! They say that these findings cast doubt on claims being made about the latest AI models. These insights challenge prevailing assumptions about LRM capabilities [...] Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds. New York University professor emeritus of psychology and neural science Gary Marcus - who has long argued that LRMs are incapable of reasoning - said that it shows that we need to move beyond the hope that making more and more capable LLMs will eventually result in intelligence.

[8]

9to5Mac

New paper pushes back on Apple's LLM 'reasoning collapse' study - 9to5Mac

Apple's recent AI research paper, "The Illusion of Thinking", has been making waves for its blunt conclusion: even the most advanced Large Reasoning Models (LRMs) collapse on complex tasks. But not everyone agrees with that framing. Today, Alex Lawsen, a researcher at Open Philanthropy, published a detailed rebuttal arguing that many of Apple's most headline-grabbing findings boil down to experimental design flaws, not fundamental reasoning limits. The paper also credits Anthropic's Claude Opus model as its co-author. Lawsen's critique, aptly titled "The Illusion of the Illusion of Thinking," doesn't deny that today's LRMs struggle with complex planning puzzles. But he argues that Apple's paper confuses practical output constraints and flawed evaluation setups with actual reasoning failure. Here are the three main issues Lawsen raises: To back up his point, Lawsen reran a subset of the Tower of Hanoi tests using a different format: asking models to generate a recursive Lua function that prints the solution instead of exhaustively listing all moves. The result? Models like Claude, Gemini, and OpenAI's o3 had no trouble producing algorithmically correct solutions for 15-disk Hanoi problems, far beyond the complexity where Apple reported zero success. Lawsen's conclusion: When you remove artificial output constraints, LRMs seem perfectly capable of reasoning about high-complexity tasks. At least in terms of algorithm generation. At first glance, this might sound like typical AI research nitpicking. But the stakes here are bigger than that. The Apple paper has been widely cited as proof that today's LLMs fundamentally lack scalable reasoning ability, which, as I argued here, might not have been the fairest way to frame the study in the first place. Lawsen's rebuttal suggests the truth may be more nuanced: yes, LLMs struggle with long-form token enumeration under current deployment constraints, but their reasoning engines may not be as brittle as the original paper implies. Or better yet, as many said it implied. Of course, none of this lets LRMs off the hook. Even Lawsen acknowledges that true algorithmic generalization remains a challenge, and his re-tests are still preliminary. He also lays out suggestions as to what future works on the subject might want to focus on: The question isn't whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing. In other words, his core point is clear: before we declare reasoning dead on arrival, it might be worth double-checking the standards by which that is being measured.

[9]

MacRumors

Apple Research Questions AI Reasoning Models Just Days Before WWDC

A newly published Apple Machine Learning Research study has challenged the prevailing narrative around AI "reasoning" large-language models like OpenAI's o1 and Claude's thinking variants, revealing fundamental limitations that suggest these systems aren't truly reasoning at all. For the study, rather than using standard math benchmarks that are prone to data contamination, Apple researchers designed controllable puzzle environments including Tower of Hanoi and River Crossing. This allowed a precise analysis of both the final answers and the internal reasoning traces across varying complexity levels, according to the researchers. The results are striking, to say the least. All tested reasoning models - including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet - experienced complete accuracy collapse beyond certain complexity thresholds, and dropped to zero success rates despite having adequate computational resources. Counterintuitively, the models actually reduce their thinking effort as problems become more complex, suggesting fundamental scaling limitations rather than resource constraints. Perhaps most damning, even when researchers provided complete solution algorithms, the models still failed at the same complexity points. Researchers say this indicates the limitation isn't in problem-solving strategy, but in basic logical step execution. Models also showed puzzling inconsistencies - succeeding on problems requiring 100+ moves while failing on simpler puzzles needing only 11 moves. The research highlights three distinct performance regimes: standard models surprisingly outperform reasoning models at low complexity, reasoning models show advantages at medium complexity, and both approaches fail completely at high complexity. The researchers' analysis of reasoning traces showed inefficient "overthinking" patterns, where models found correct solutions early but wasted computational budget exploring incorrect alternatives. The take-home of Apple's findings is that current "reasoning" models rely on sophisticated pattern matching rather than genuine reasoning capabilities. It suggests that LLMs don't scale reasoning like humans do, overthinking easy problems and thinking less for harder ones. The timing of the publication is notable, having emerged just days before WWDC 2025, where Apple is expected to limit its focus on AI in favor of new software designs and features, according to Bloomberg.

[10]

Mashable

'The illusion of thinking': Apple research finds AI models collapse and give up with hard puzzles

New artificial intelligence research from Apple shows AI reasoning models may not be "thinking" so well after all. According to a paper published just days before Apple's WWDC event, large reasoning models (LRMs) -- like OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking -- completely collapse when they're faced with increasingly complex problems. The paper comes from the same researchers who found other reasoning flaws in LLMs last year. The news was a bucket of cold water for artificial general intelligence (AGI) optimists (and welcome news for AI and AGI skeptics), as Apple's research seemed to show damning evidence about the limitations of reasoning model intelligence. While the much-hyped LRM performed better than LLMs on medium-difficulty puzzles, they performed worse on simple puzzles. And according to Apple's research, when they faced hard puzzles, they collapsed completely, giving up on the problem prematurely. This Tweet is currently unavailable. It might be loading or has been removed. Or, as the Apple researchers put it, while AI models perform extremely well at math and coding, when it comes to more complex problems, they only provide "The Illusion of Thinking." Apple was slow to develop large language models and implement AI in its devices, largely staying out of the conversation. The company has added Apple Intelligence AI features, though they have generally been considered underwhelming. With that in mind, this research might explain some of Apple's reticence to go all-in on AI, unlike Google and Samsung, which have frontloaded their devices with AI capabilities. The problems researchers used to evaluate the reasoning models, which they call LRMs or Large Reasoning Models, are classic logic puzzles like the Tower of Hanoi. The puzzle consists of discs, stacked largest to smallest on one of three pegs, and the goal is to move the discs to the third peg without ever placing a larger disc on top of a smaller disc. Other puzzles included jumping checker pieces into empty spaces, the river-crossing problem (the one usually involving a fox, a chicken, and a bag of grain), and stacking blocks in a specific configuration. You probably recognize these logic puzzles from math class or online games, since it's a simple way of testing humans' ability to reason and problem-solve. Once you figure it out, it's a simple matter of following the logic even as the complexity increases, which in this case means more discs, checkers, animals, or blocks. However, researchers found that LRMs start to fail after a certain point. "Results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold," researchers wrote. In the results shown, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi problem. Even when more computing power is applied to the LRMs, they still fail at the more complex puzzles. What's more, researchers found that reasoning models initially apply more thinking tokens as complexity increases, but they actually give up at a certain point. "Upon approaching a critical threshold -- which closely corresponds to their accuracy collapse point -- models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty," the paper read. So when the problems get harder, they spend less tokens, or "think" less. But what about when the LRMs are given the answers? Nope, accuracy doesn't improve. Even when researchers included the algorithm in the prompt, so all the models need to do is follow the steps, they continued to fail. But before you fire up the grill because LLM reasoning is so cooked, season these findings with a grain of salt. The research doesn't mean LRMs don't reason at all, it just means they may not currently be much smarter than humans. As AI expert Gary Marcus pointed out on his blog, "(ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs." As others have pointed out online, the research does not compare results from human attempts at these puzzles. This Tweet is currently unavailable. It might be loading or has been removed. Essentially, LLMs have their uses for tasks like coding and writing, but they also have weaknesses. "What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms," wrote Marcus, who has been very vocal about the reasoning limitations of AI models. That's to say, take the findings from Apple researchers for what they are: important data to be considered within the context of other LLM research. It's tempting to categorize AI's overall advancements as overhyped when new research like this comes out. Or, on the flip side, for AGI boosters to claim victory when research has discovered new advancements. But the reality is usually somewhere in the boring middle.

[11]

The Guardian

Cutting-edge AI models 'collapse' in face of complex problems, Apple study finds

'Pretty devastating' paper raises doubts about race to reach stage of AI at which systems match human intelligence Apple researchers have found "fundamental limitations" in cutting-edge artificial intelligence models, in a paper raising doubts about the technology industry's race to develop ever more powerful systems. Apple claimed in a paper published at the weekend that large reasoning models (LRMs) - an advanced form of AI - faced a "complete accuracy collapse" when presented with highly complex problems. It found that standard AI models outperformed LRMs in low-complexity tasks, while both types of model suffered "complete collapse" with high-complexity tasks. Large reasoning models attempt to solve complex queries by generating detailed thinking processes that break down the problem into smaller steps. The study, which tested the models' ability to solve puzzles, added that as LRMs neared performance collapse they began "reducing their reasoning effort". The Apple researchers said they found this "particularly concerning". Gary Marcus, a US academic who has become a prominent voice of caution on the capabilities of AI models, described the Apple paper as "pretty devastating". Marcus added that the findings raised questions about the race to artificial general intelligence (AGI), a theoretical stage of AI at which a system is able to match a human at carrying out any intellectual task. Referring to the large language models [LLMs] that underpin tools such as ChatGPT, Marcus wrote: "Anybody who thinks LLMs are a direct route to the sort [of] AGI that could fundamentally transform society for the good is kidding themselves." The paper also found that reasoning models wasted computing power by finding the right solution for simpler problems early in their "thinking". However, as problems became slightly more complex, models first explored incorrect solutions and arrived at the correct ones later. For higher-complexity problems, however, the models would enter "collapse", failing to generate any correct solutions. In one case, even when provided with an algorithm that would solve the problem, the models failed. The paper said: "Upon approaching a critical threshold - which closely corresponds to their accuracy collapse point - models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty." The Apple experts said this indicated a "fundamental scaling limitation in the thinking capabilities of current reasoning models". The paper set the LRMs puzzle challenges, such as solving the Tower of Hanoi and River Crossing puzzles. The researchers acknowledged that the focus on puzzles represented a limitation in its work. The paper concluded that the current approach to AI may have reached limitations. It tested models including OpenAI's o3, Google's Gemini Thinking, Anthropic's Claude 3.7 Sonnet-Thinking and DeepSeek-R1. Anthropic, Google and DeepSeek have been contacted for comment. OpenAI, the company behind ChatGPT, declined to comment. Referring to "generalizable reasoning" - or an AI model's ability to apply a narrow conclusion more broadly - the paper said: "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning." Andrew Rogoyski, of the Institute for People-Centred AI at the University of Surrey, said the Apple paper signalled the industry was "still feeling its way" on AGI and that the industry could have reached a "cul de sac" in its current approach. "The finding that large reason models lose the plot on complex problems, while performing well on medium- and low-complexity problems implies that we're in a potential cul-de-sac in current approaches," he said.

[12]

The Guardian

When billion-dollar AIs break down over puzzles a child can do, it's time to rethink the hype | Gary Marcus

The tech world is reeling from a paper that shows the powers of a new generation of AI have been wildly oversold A research paper by Apple has taken the tech world by storm, all but eviscerating the popular notion that large language models (LLMs, and their newest variant, LRMs, large reasoning models) are able to reason reliably. Some are shocked by it, some are not. The well-known venture capitalist Josh Wolfe went so far as to post on X that "Apple [had] just GaryMarcus'd LLM reasoning ability" - coining a new verb (and a compliment to me), referring to "the act of critically exposing or debunking the overhyped capabilities of artificial intelligence ... by highlighting their limitations in reasoning, understanding, or general intelligence". Apple did this by showing that leading models such as ChatGPT, Claude and Deepseek may "look smart - but when complexity rises, they collapse". In short, these models are very good at a kind of pattern recognition, but often fail when they encounter novelty that forces them beyond the limits of their training, despite being, as the paper notes, "explicitly designed for reasoning tasks". As discussed later, there is a loose end that the paper doesn't tie up, but on the whole, its force is undeniable. So much so that LLM advocates are already partly conceding the blow while hinting at, or at least hoping for, happier futures ahead. In many ways the paper echoes and amplifies an argument that I have been making since 1998: neural networks of various kinds can generalise within a distribution of data they are exposed to, but their generalisations tend to break down beyond that distribution. A simple example of this is that I once trained an older model to solve a very basic mathematical equation using only even-numbered training data. The model was able to generalise a little bit: solve for even numbers it hadn't seen before, but unable to do so for problems where the answer was an odd number. More than a quarter of a century later, when a task is close to the training data, these systems work pretty well. But as they stray further away from that data, they often break down, as they did in the Apple paper's more stringent tests. Such limits arguably remain the single most important serious weakness in LLMs. The hope, as always, has been that "scaling" the models by making them bigger, would solve these problems. The new Apple paper resoundingly rebuts these hopes. They challenged some of the latest, greatest, most expensive models with classic puzzles, such as the Tower of Hanoi - and found that deep problems lingered. Combined with numerous hugely expensive failures in efforts to build GPT-5 level systems, this is very bad news. The Tower of Hanoi is a classic game with three pegs and multiple discs, in which you need to move all the discs on the left peg to the right peg, never stacking a larger disc on top of a smaller one. With practice, though, a bright (and patient) seven-year-old can do it. What Apple found was that leading generative models could barely do seven discs, getting less than 80% accuracy, and pretty much can't get scenarios with eight discs correct at all. It is truly embarrassing that LLMs cannot reliably solve Hanoi. And, as the paper's co-lead-author Iman Mirzadeh told me via DM, "it's not just about 'solving' the puzzle. We have an experiment where we give the solution algorithm to the model, and [the model still failed] ... based on what we observe from their thoughts, their process is not logical and intelligent". The new paper also echoes and amplifies several arguments that Arizona State University computer scientist Subbarao Kambhampati has been making about the newly popular LRMs. He has observed that people tend to anthropomorphise these systems, to assume they use something resembling "steps a human might take when solving a challenging problem". And he has previously shown that in fact they have the same kind of problem that Apple documents. If you can't use a billion-dollar AI system to solve a problem that Herb Simon (one of the actual godfathers of AI) solved with classical (but out of fashion) AI techniques in 1957, the chances that models such as Claude or o3 are going to reach artificial general intelligence (AGI) seem truly remote. So what's the loose thread that I warn you about? Well, humans aren't perfect either. On a puzzle like Hanoi, ordinary humans actually have a bunch of (well-known) limits that somewhat parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with eight discs. But look, that's why we invented computers, and for that matter calculators: to reliably compute solutions to large, tedious problems. AGI shouldn't be about perfectly replicating a human, it should be about combining the best of both worlds; human adaptiveness with computational brute force and reliability. We don't want an AGI that fails to "carry the one" in basic arithmetic just because sometimes humans do. Whenever people ask me why I actually like AI (contrary to the widespread myth that I am against it), and think that future forms of AI (though not necessarily generative AI systems such as LLMs) may ultimately be of great benefit to humanity, I point to the advances in science and technology we might make if we could combine the causal reasoning abilities of our best scientists with the sheer compute power of modern digital computers. What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that these LLMs that have generated so much hype are no substitute for good, well-specified conventional algorithms. (They also can't play chess as well as conventional algorithms, can't fold proteins like special-purpose neurosymbolic hybrids, can't run databases as well as conventional databases, etc.) What this means for business is that you can't simply drop o3 or Claude into some complex problem and expect them to work reliably. What it means for society is that we can never fully trust generative AI; its outputs are just too hit-or-miss. One of the most striking findings in the new paper was that an LLM may well work in an easy test set (such as Hanoi with four discs) and seduce you into thinking it has built a proper, generalisable solution when it has not. To be sure, LLMs will continue to have their uses, especially for coding and brainstorming and writing, with humans in the loop. But anybody who thinks LLMs are a direct route to the sort of AGI that could fundamentally transform society for the good is kidding themselves.

[13]

VentureBeat

Do reasoning models really "think" or not? Apple research sparks lively debate, response

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Apple's machine-learning group set off a rhetorical firestorm earlier this month with its release of "The Illusion of Thinking," a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning large language models (reasoning LLMs) such as OpenAI's "o" series and Google's Gemini-2.5 Pro and Flash Thinking don't actually engage in independent "thinking" or "reasoning" from generalized first principles learned from their training data. Instead, the authors contend, these reasoning LLMs are actually performing a kind of "pattern matching" and their apparent reasoning ability seems to fall apart once a task becomes too complex, suggesting that their architecture and performance is not a viable path to improving generative AI to the point that it is artificial generalized intelligence (AGI), which OpenAI defines as a model that outperforms humans at most economically valuable work, or superintelligence, AI even smarter than human beings can comprehend. ACT NOW: Come discuss the latest LLM advances and research at VB Transform on June 24-25 in SF -- limited tickets available. REGISTER NOW Unsurprisingly, the paper immediately circulated widely among the machine learning community on X and many readers' initial reactions were to declare that Apple had effectively disproven much of the hype around this class of AI: "Apple just proved AI 'reasoning' models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all," declared Ruben Hassid, creator of EasyGen, an LLM-driven LinkedIn post auto writing tool. "They just memorize patterns really well." But now today, a new paper has emerged, the cheekily titled "The Illusion of The Illusion of Thinking" -- importantly, co-authored by a reasoning LLM itself, Claude Opus 4 and Alex Lawsen, a human being and independent AI researcher and technical writer -- that includes many criticisms from the larger ML community about the paper and effectively argues that the methodologies and experimental designs the Apple Research team used in their initial work are fundamentally flawed. While we here at VentureBeat are not ML researchers ourselves and not prepared to say the Apple Researchers are wrong, the debate has certainly been a lively one and the issue about the capabilities of LRMs or reasoner LLMs compared to human thinking seems far from settled. How the Apple Research study was designed -- and what it found Using four classic planning problems -- Tower of Hanoi, Blocks World, River Crossing and Checkers Jumping -- Apple's researchers designed a battery of tasks that forced reasoning models to plan multiple moves ahead and generate complete solutions. These games were chosen for their long history in cognitive science and AI research and their ability to scale in complexity as more steps or constraints are added. Each puzzle required the models to not just produce a correct final answer, but to explain their thinking along the way using chain-of-thought prompting. As the puzzles increased in difficulty, the researchers observed a consistent drop in accuracy across multiple leading reasoning models. In the most complex tasks, performance plunged to zero. Notably, the length of the models' internal reasoning traces -- measured by the number of tokens spent thinking through the problem -- also began to shrink. Apple's researchers interpreted this as a sign that the models were abandoning problem-solving altogether once the tasks became too hard, essentially "giving up." The timing of the paper's release, just ahead of Apple's annual Worldwide Developers Conference (WWDC), added to the impact. It quickly went viral across X, where many interpreted the findings as a high-profile admission that current-generation LLMs are still glorified autocomplete engines, not general-purpose thinkers. This framing, while controversial, drove much of the initial discussion and debate that followed. Critics take aim on X Among the most vocal critics of the Apple paper was ML researcher and X user @scaling01 (aka "Lisan al Gaib"), who posted multiple threads dissecting the methodology. In one widely shared post, Lisan argued that the Apple team conflated token budget failures with reasoning failures, noting that "all models will have 0 accuracy with more than 13 disks simply because they cannot output that much!" For puzzles like Tower of Hanoi, he emphasized, the output size grows exponentially, while the LLM context windows remain fixed, writing "just because Tower of Hanoi requires exponentially more steps than the other ones, that only require quadratically or linearly more steps, doesn't mean Tower of Hanoi is more difficult" and convincingly showed that models like Claude 3 Sonnet and DeepSeek-R1 often produced algorithmically correct strategies in plain text or code -- yet were still marked wrong. Another post highlighted that even breaking the task down into smaller, decomposed steps worsened model performance -- not because the models failed to understand, but because they lacked memory of previous moves and strategy. "The LLM needs the history and a grand strategy," he wrote, suggesting the real problem was context-window size rather than reasoning. I raised another important grain of salt myself on X: Apple never benchmarked the model performance against human performance on the same tasks. "Am I missing it, or did you not compare LRMs to human perf[ormance] on [the] same tasks?? If not, how do you know this same drop-off in perf doesn't happen to people, too?" I asked the researchers directly in a thread tagging the paper's authors. I also emailed them about this and many other questions, but they have yet to respond. Others echoed that sentiment, noting that human problem solvers also falter on long, multistep logic puzzles, especially without pen-and-paper tools or memory aids. Without that baseline, Apple's claim of a fundamental "reasoning collapse" feels ungrounded. Several researchers also questioned the binary framing of the paper's title and thesis -- drawing a hard line between "pattern matching" and "reasoning." Alexander Doria aka Pierre-Carl Langlais, an LLM trainer at energy efficient French AI startup Pleias, said the framing misses the nuance, arguing that models might be learning partial heuristics rather than simply matching patterns. Ethan Mollick, the AI focused professor at University of Pennsylvania's Wharton School of Business, called the idea that LLMs are "hitting a wall" premature, likening it to similar claims about "model collapse" that didn't pan out. Meanwhile, critics like @arithmoquine were more cynical, suggesting that Apple -- behind the curve on LLMs compared to rivals like OpenAI and Google -- might be trying to lower expectations," coming up with research on "how it's all fake and gay and doesn't matter anyway" they quipped, pointing out Apple's reputation with now poorly performing AI products like Siri. In short, while Apple's study triggered a meaningful conversation about evaluation rigor, it also exposed a deep rift over how much trust to place in metrics when the test itself might be flawed. A measurement artifact, or a ceiling? In other words, the models may have understood the puzzles but ran out of "paper" to write the full solution. "Token limits, not logic, froze the models," wrote Carnegie Mellon researcher Rohan Paul in a widely shared thread summarizing the follow-up tests. Yet not everyone is ready to clear LRMs of the charge. Some observers point out that Apple's study still revealed three performance regimes -- simple tasks where added reasoning hurts, mid-range puzzles where it helps, and high-complexity cases where both standard and "thinking" models crater. Others view the debate as corporate positioning, noting that Apple's own on-device "Apple Intelligence" models trail rivals on many public leaderboards. The rebuttal: "The Illusion of the Illusion of Thinking" In response to Apple's claims, a new paper titled "The Illusion of the Illusion of Thinking" was released on arXiv by independent researcher and technical writer Alex Lawsen of the nonprofit Open Philanthropy, in collaboration with Anthropic's Claude Opus 4. The paper directly challenges the original study's conclusion that LLMs fail due to an inherent inability to reason at scale. Instead, the rebuttal presents evidence that the observed performance collapse was largely a by-product of the test setup -- not a true limit of reasoning capability. Lawsen and Claude demonstrate that many of the failures in the Apple study stem from token limitations. For example, in tasks like Tower of Hanoi, the models must print exponentially many steps -- over 32,000 moves for just 15 disks -- leading them to hit output ceilings. The rebuttal points out that Apple's evaluation script penalized these token-overflow outputs as incorrect, even when the models followed a correct solution strategy internally. The authors also highlight several questionable task constructions in the Apple benchmarks. Some of the River Crossing puzzles, they note, are mathematically unsolvable as posed, and yet model outputs for these cases were still scored. This further calls into question the conclusion that accuracy failures represent cognitive limits rather than structural flaws in the experiments. To test their theory, Lawsen and Claude ran new experiments allowing models to give compressed, programmatic answers. When asked to output a Lua function that could generate the Tower of Hanoi solution -- rather than writing every step line-by-line -- models suddenly succeeded on far more complex problems. This shift in format eliminated the collapse entirely, suggesting that the models didn't fail to reason. They simply failed to conform to an artificial and overly strict rubric. Why it matters for enterprise decision-makers The back-and-forth underscores a growing consensus: evaluation design is now as important as model design. Requiring LRMs to enumerate every step may test their printers more than their planners, while compressed formats, programmatic answers or external scratchpads give a cleaner read on actual reasoning ability. The episode also highlights practical limits developers face as they ship agentic systems -- context windows, output budgets and task formulation can make or break user-visible performance. For enterprise technical decision makers building applications atop reasoning LLMs, this debate is more than academic. It raises critical questions about where, when, and how to trust these models in production workflows -- especially when tasks involve long planning chains or require precise step-by-step output. If a model appears to "fail" on a complex prompt, the problem may not lie in its reasoning ability, but in how the task is framed, how much output is required, or how much memory the model has access to. This is particularly relevant for industries building tools like copilots, autonomous agents, or decision-support systems, where both interpretability and task complexity can be high. Understanding the constraints of context windows, token budgets, and the scoring rubrics used in evaluation is essential for reliable system design. Developers may need to consider hybrid solutions that externalize memory, chunk reasoning steps, or use compressed outputs like functions or code instead of full verbal explanations. Most importantly, the paper's controversy is a reminder that benchmarking and real-world application are not the same. Enterprise teams should be cautious of over-relying on synthetic benchmarks that don't reflect practical use cases -- or that inadvertently constrain the model's ability to demonstrate what it knows. Ultimately, the big takeaway for ML researchers is that before proclaiming an AI milestone -- or obituary -- make sure the test itself isn't putting the system in a box too small to think inside.

[14]

Futurism

Apple Researchers Just Released a Damning Paper That Pours Water on the Entire AI Industry

Researchers at Apple have released an eyebrow-raising paper that throws cold water on the "reasoning" capabilities of the latest, most powerful large language models. In the paper, a team of machine learning experts makes the case that the AI industry is grossly overstating the ability of its top AI models, including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini. In particular, the researchers assail the claims of companies like OpenAI that their most advanced models can now "reason" -- a supposed capability that the Sam Altman-led company has increasingly leaned on over the past year for marketing purposes -- which the Apple team characterizes as merely an "illusion of thinking." It's a particularly noteworthy finding, considering Apple has been accused of falling far behind the competition in the AI space. The company has chosen a far more careful path to integrating the tech in its consumer-facing products -- with some seriously mixed results so far. In theory, reasoning models break down user prompts into pieces and use sequential "chain of thought" steps to arrive at their answers. But now, Apple's own top minds are questioning whether frontier AI models simply aren't as good at "thinking" as they're being made out to be. "While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood," the team wrote in its paper. The authors -- who include Samy Bengio, the director of Artificial Intelligence and Machine Learning Research at the software and hardware giant -- argue that the existing approach to benchmarking "often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality." By using "controllable puzzle environments," the team estimated the AI models' ability to "think" -- and made a seemingly damning discovery. "Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," they wrote. Thanks to a "counter-intuitive scaling limit," the AIs' reasoning abilities "declines despite having an adequate token budget." Put simply, even with sufficient training, the models are struggling with problem beyond a certain threshold of complexity -- the result of "an 'overthinking' phenomenon," in the paper's phrasing. The finding is reminiscent of a broader trend. Benchmarks have shown that the latest generation of reasoning models is more prone to hallucinating, not less, indicating the tech may now be heading in the wrong direction in a key way. Exactly how reasoning models choose which path to take remains surprisingly murky, the Apple researchers found. "We found that LRMs have limitations in exact computation," the team concluded in its paper. "They fail to use explicit algorithms and reason inconsistently across puzzles." The researchers claim their findings raise "crucial questions" about the current crop of AI models' "true reasoning capabilities," undercutting a much-hyped new avenue in the burgeoning industry. That's despite tens of billions of dollars being poured into the tech's development, with the likes of OpenAI, Google, and Meta, constructing enormous data centers to run increasingly power-hungry AI models. Could the Apple researchers' finding be yet another canary in the coalmine, suggesting the tech has "hit a wall"? Or is the company trying to hedge its bets, calling out its outperforming competition as it lags behind, as some have suggested? It's certainly a surprising conclusion, considering Apple's precarious positioning in the AI industry: at the same time that its researchers are trashing the tech's current trajectory, it's promised a suite of Apple Intelligence tools for its devices like the iPhone and MacBook. "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the paper reads.

[15]

Futurism

Frontier AI Models Are Getting Stumped by a Simple Children's Game

Earlier this week, researchers at Apple released a damning paper, criticizing the AI industry for vastly overstating the ability of its top AI models to reason or "think." The team found that the models including OpenAI's o3, Anthropic's Claude 3.7, and Google's Gemini were stumped by even the simplest of puzzles. For instance, the "large reasoning models," or LRMs, consistently failed at Tower of Hanoi, a children's puzzle game that involves three pegs and a number of differently-sized disks that have to be arranged in a specific order. The researchers found that the AI models' accuracy in the game was less than 80 percent with seven disks, and were more or less entirely stumped by puzzles involving eight disks. They also consistently failed at Blocks World, a block-stacking puzzle, and River Crossing, a puzzle that involves moving items across a river using a boat with several constraints. "Through extensive experimentation across diverse puzzles, we show that frontier [large reasoning models] face a complete accuracy collapse beyond certain complexities," the Apple researchers wrote. It was an eyebrow-raising finding, highlighting how even the most sophisticated of AI models are still failing to logic their way through simple puzzles, despite being made out to be something far more sophisticated by their makers' breathless marketing. Those approaches to selling the tech to the public have led to users anthropomorphizing AI models -- or thinking of them like humans -- leading to a major schism between their presumed and actual capabilities. The findings amplify ongoing fears that current AI approaches, including "reasoning" AI models that break down tasks into individual steps, are a dead end, despite billions of dollars being poured into their development. Worse yet, past a threshold of complexity, their shortcomings are becoming even more apparent, undercutting the AI industry's promises that simply scaling up the models' training data could make them more intelligent and capable of "reasoning." Noted AI critic Gary Marcus wasn't surprised by the researchers' findings. "In many ways, the paper echoes and amplifies an argument that I have been making since 1998," he wrote in a recent post on his Substack, referring to a paper he authored over 26 years ago. "Neural networks of various kinds can generalise within a distribution of data they are exposed to, but their generalisations tend to break down beyond that distribution." In short, getting stumped by simple children's games isn't exactly what you'd expect from AI models being sold as the next breakthrough in problem-solving and a step toward artificial general -- or superhuman -- intelligence (AGI), the stated goal of OpenAI. "It's not just about 'solving' the puzzle," colead author and Apple machine learning engineer Iman Mirzadeh told Marcus. "We have an experiment where we give the solution algorithm to the model, and [the model still failed]... based on what we observe from their thoughts, their process is not logical and intelligent." Marcus argues that large language and reasoning models simply cast far too wide a net and easily get lost as a result. "What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that these LLMs that have generated so much hype are no substitute for good, well-specified conventional algorithms," he wrote. "What this means for business is that you can't simply drop [OpenAI's LLM] o3 or Claude into some complex problem and expect them to work reliably," the critic added. "What it means for society is that we can never fully trust generative AI; its outputs are just too hit-or-miss." While many valid use cases for the models remain, "anybody who thinks LLMs are a direct route to the sort AGI that could fundamentally transform society for the good is kidding themselves," Marcus concluded.

[16]

AIM

Apple Says Claude, DeepSeek-R1, and o3-mini Can't Really Reason | AIM

The researchers argue that traditional benchmarks, like math and coding tests, are flawed due to "data contamination" and fail to reveal how these models actually "think". AI critic Gary Marcus is smiling again, thanks to Apple. In a new paper titled The Illusion of Thinking, researchers from the Cupertino-based company argue that even the most advanced AI models, including the so-called large reasoning models (LRMs), don't actually think. Instead, they simulate reasoning without truly understanding or solving complex problems. The paper, released just ahead of Apple's Worldwide Developer Conference, tested leading AI models, including OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, using specially designed algorithmic puzzle environments rather than standard benchmarks. The researchers argue that traditional benchmarks, like math and coding tests, are flawed due to "data contamination" and fail to reveal how these models actually "think". "We show that state-of-the-art LRMs still fail to develop generalisable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments," the paper noted. Interestingly, one of the authors of the paper is Samy Bengio, the brother of Turing Award winner Yoshua Bengio. Yoshua recently launched LawZero, a Canada-based nonprofit AI safety lab working on building systems that prioritise truthfulness, safety, and ethical behaviour over commercial interests. The lab has secured around $30 million in initial funding from prominent backers, including former Google CEO Eric Schmidt's philanthropic organisation, Skype co-founder Jaan Tallinn, Open Philanthropy, and the Future of Life Institute. Backing the paper's claims, Marcus could not hold his excitement. "AI is not hitting a wall. But LLMs probably are (or at least a point of diminishing returns). We need new approaches, and to diversify the which roads are being actively explored." "I don't think LLMs are a good way to get there (AGI). They might be part of the answer, but I don't think they are the whole answer," Marcus said in a previous interaction with AIM, stressing that LLMs are not "useless". He also expressed optimism about AGI, describing it as a machine capable of approaching new problems with the flexibility and resourcefulness of a smart human being. "I think we'll see it someday," he further said. Taking a more balanced view, Ethan Mollick, professor at The Wharton School, said in a post on X, "I think the Apple paper on the limits of reasoning models in particular tests is useful & important, but the "LLMs are hitting a wall" narrative on X around it feels premature at best. Reminds me of the buzz over model collapse -- limitations that were overcome quickly in practice." He added that the current approach to reasoning likely has real limitations for a variety of reasons. However, the reasoning approaches themselves were made public less than a year ago. "There are just a lot of approaches that might overcome these issues. Or they may not. It's just very early." Hemanth Mohapatra, partner at Lightspeed India, said that the recent Apple paper showing reasoning struggles with complex problems confirms what many experts, like Yann LeCun, have long sensed. He acknowledged that while a new direction is necessary, current AI capabilities still promise significant productivity gains. "We do need a different hill to climb, but that doesn't mean existing capabilities won't have huge impact on productivity," he said. Meanwhile, Subbarao Kambhampati, professor at Arizona State University, who has been pretty vocal about LLMs' inability to reason and think, quipped that another advantage of being a university researcher in AI is, "You don't have to deal with either the amplification or the backlash as a surrogate for 'The Company'. Your research is just your research, fwiw." Instead of relying on familiar benchmarks, Apple's team used controlled puzzle environments, such as variants of the Tower of Hanoi, to precisely manipulate problem complexity and observe how models generate step-by-step "reasoning traces". This allowed them to see not just the final answer, but the process the model used to get there. The paper found that for simpler problems, non-reasoning models often outperformed more advanced LRMs, which tended to "overthink" and miss the correct answer. As the difficulty level rose to moderate, the reasoning models showed their strength, successfully following more intricate logical steps. However, when faced with truly complex puzzles, all models, regardless of their architecture, struggled and ultimately failed. Rather than putting in more effort, the AI responses grew shorter and less thoughtful, as if the models were giving up. While large language models continue to struggle with complex reasoning, that doesn't make them useless. Abacus.AI CEO Bindu Reddy pointed out on X, many people are misinterpreting the paper as proof that LLMs don't work. "All this paper is saying is LLMs can't solve arbitrarily hard problems yet," she said, adding that they're already handling tasks beyond the capabilities of most humans. The researchers suggest that what appears to be reasoning is often just the retrieval and adaptation of memorised solution templates from training data, not genuine logical deduction. When confronted with unfamiliar and highly complex problems, the models' reasoning abilities tend to collapse almost immediately, revealing that what appears to be reasoning is often just an illusion of thought. The study makes it clear that current large language models are still far from being true general-purpose reasoners. Their ability to handle reasoning tasks does not extend beyond a certain level of complexity, and even targeted efforts to train them with the correct algorithms result in only minor improvements. Andrew White, co-founder of FutureHouse, questioned Apple's approach, saying that its AI researchers seem to have adopted an "anti-LLM cynic ethos" by repeatedly publishing papers that argue reasoning LLMs are fundamentally limited and lack generalisation ability. He pointed out the irony, saying Apple has "the worst AI products" like Siri and Apple Intelligence, and admitted he has no idea what their actual strategy is. Apple's research serves as a cautionary message for AI developers and users alike. While today's chatbots and reasoning models appear impressive, their core abilities remain limited. As the paper puts it, "despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds." "We need models that can represent and manipulate abstract structures, not just predict tokens. Hybrid systems that combine LLMs with symbolic logic, memory modules, or algorithmic planners are showing early promise. These aren't just add-ons -- they reshape how the system thinks," said Pradeep Sanyal, AI and data leader at a global tech consulting firm, in a LinkedIn post. He further added that combining neural and symbolic parts isn't without drawbacks. It introduces added complexity around coordination, latency, and debugging. But the improvements in precision and transparency make it a direction worth exploring.

[17]

Cointelegraph

AI models still far from AGI-level reasoning: Apple researchers

Current "thinking" AI models still can't reason to a level that would be expected from humanlike artificial general intelligence, the researchers found. The race to develop artificial general intelligence (AGI) still has a long way to run, according to Apple researchers who found that leading AI models still have trouble reasoning. Recent updates to leading AI large language models (LLMs) such as OpenAI's ChatGPT and Anthropic's Claude have included large reasoning models (LRMs), but their fundamental capabilities, scaling properties, and limitations "remain insufficiently understood," said the Apple researchers in a June paper called "The Illusion of Thinking." They noted that current evaluations primarily focus on established mathematical and coding benchmarks, "emphasizing final answer accuracy." However, this evaluation does not provide insights into the reasoning capabilities of the AI models, they said. The research contrasts with an expectation that artificial general intelligence is just a few years away. The researchers devised different puzzle games to test "thinking" and "non-thinking" variants of Claude Sonnet, OpenAI's o3-mini and o1, and DeepSeek-R1 and V3 chatbots beyond the standard mathematical benchmarks. They discovered that "frontier LRMs face a complete accuracy collapse beyond certain complexities," don't generalize reasoning effectively, and their edge disappears with rising complexity, contrary to expectations for AGI capabilities. "We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles." They found inconsistent and shallow reasoning with the models and also observed overthinking, with AI chatbots generating correct answers early and then wandering into incorrect reasoning. Related: AI solidifying role in Web3, challenging DeFi and gaming: DappRadar The researchers concluded that LRMs mimic reasoning patterns without truly internalizing or generalizing them, which falls short of AGI-level reasoning. "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning." AGI is the holy grail of AI development, a state where the machine can think and reason like a human and is on a par with human intelligence. In January, OpenAI CEO Sam Altman said the firm was closer to building AGI than ever before. "We are now confident we know how to build AGI as we have traditionally understood it," he said at the time. In November, Anthropic CEO Dario Amodei said that AGI would exceed human capabilities in the next year or two. "If you just eyeball the rate at which these capabilities are increasing, it does make you think that we'll get there by 2026 or 2027," he said.

[18]

Dataconomy

Apple's quiet AI lab reveals how large models fake thinking

The latest generation of AI models, often called large reasoning models (LRMs), has dazzled the world with its ability to "think." Before giving an answer, these models produce long, detailed chains of thought, seemingly reasoning their way through complex problems. This has led many to believe we are on the cusp of true artificial general intelligence. But are these models really thinking? A new, insightful paper from researchers at Apple, titled "The Illusion of Thinking," puts this capability under a microscope and comes to some startling conclusions. By moving away from standard math tests -- which are often "contaminated" with answers the AI has already seen during training -- and into a controlled lab of complex puzzles, the researchers uncovered fundamental limits to AI reasoning. Today's most advanced AI isn't so much a brilliant thinker as it is an incredibly sophisticated pattern-matcher that quickly hits a wall when faced with truly new challenges. The researchers tested pairs of AI models -- one "thinking" LRM and its standard "non-thinking" counterpart -- on a series of puzzles like the Tower of Hanoi and River Crossing. By precisely increasing the difficulty, they discovered three distinct performance regimes: As the paper states, these models "fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities." Perhaps the most fascinating discovery is how the reasoning models fail. You would expect that as a problem gets harder, the AI would "think" more, using more of its computational budget. And it does -- but only up to a point. The research reveals a counterintuitive scaling limit. When a problem approaches the "collapse" point, the LRM starts to reduce its reasoning effort, spending fewer tokens on thinking despite the increasing difficulty. It's as if the model recognizes the task as too hard and simply gives up before it even starts, even with an adequate budget to keep trying. This suggests a fundamental limitation in their ability to scale their reasoning effort with a problem's difficulty. What if you made it even easier for the AI? What if you gave it the exact, step-by-step algorithm to solve the puzzle? Surely, a true reasoning machine could just follow the instructions. In a stunning finding, the researchers found this wasn't the case. "Even when we provide the algorithm in the prompt -- so that the model only needs to execute the prescribed steps -- performance does not improve, and the observed collapse still occurs at roughly the same point." This is the most damning evidence against the idea that these models "reason" in a human-like way. Their inability to execute a simple, explicit set of logical rules shows that their success relies more on recognizing familiar patterns than on genuine, symbolic manipulation. The model's inconsistent performance across different puzzle types further supports this, suggesting its ability is tied to the examples it has memorized from the web, not a general problem-solving skill.

[19]

Gadgets 360

Apple Researchers Find 'Accuracy Collapse' Problem in AI Reasoning Models

Claude 3.7 Sonnet and DeepSeek V3/R1 was chosen for this experiment Apple published a research paper on Saturday, where researchers examine the strengths and weaknesses of recently released reasoning models. Also known as large reasoning models (LRMs), these are the models that "think" by utilising additional compute to solve complex problems. However, the paper found that even the most powerful models struggle with a complexity issue. Researchers said that when a problem is highly complex, the models experience a total collapse and give up on the problem instead of using more compute, which is something they're trained to do. In a paper titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," published on Apple's website, the researchers claim both LRMs and large language models (LLMs) without thinking capability behave differently when faced with three regimes of complexity. The paper has described three regimes of complexity which are low complexity tasks, medium complexity tasks, and high complexity tasks. To test how LLMs and LRMs function when dealing with a wide range of complexities, the researchers decided to use several puzzles that can have an increasing level of difficulty. One puzzle in particular was the Tower of Hanoi. The Tower of Hanoi is a mathematical puzzle with three pegs and several disks. Disks are arranged in a decreasing order of size to create a pyramid-like shape. The objective of the puzzle is to shift the disks from the leftmost peg to the rightmost peg, while moving one disk at a time. There is a catch -- at no time should a larger disk be placed on top of a smaller disk. It is not a very difficult puzzle, and it is often targeted at children between the ages of six and 15. Apple researchers chose two reasoning models and their non-reasoning counterparts for this experiment. The LLMs chosen were Claude 3.7 Sonnet and DeepSeek-V3, while the LRMs were Claude 3.7 Sonnet with Thinking and DeepSeek-R1. The thinking budget was maximised at 64,000 tokens each. The aim of the experiment was not just to check the final accuracy, but also the accuracy in logic in choosing the steps to solve the puzzle. In the low complexity task, up to three disks were added, whereas for the medium complexity task, disk sizes were kept between four to 10. Finally, in the high complexity task, there were between 11-20 disks. The researchers noted that both LLMs and LRMs displayed equal aptitude in solving the low complexity task. When the difficulty was increased, reasoning models were able to solve the puzzle more accurately, given the extra budget of compute. However, when the tasks reached the high complexity zone, it was found that both models showed a complete collapse of reasoning. The same experiment was also said to be repeated with more models and more puzzles, such as Checkers Jumping, River Crossing, and Blocks World. Apple's research paper highlights the concerns that several others in the artificial intelligence (AI) space have already expressed. While reasoning models can generalise within their distributed datasets, whenever any problem falls beyond them, the models struggle in "thinking," and either try to take shortcuts in finding the solution, or completely give up and collapse. "Current evaluations primarily focus on established mathematical and coding benchmarks, emphasising final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality," the company said in a post.

[20]

Observer

Apple Research Finds 'Reasoning' A.I. Models Aren't Actually Reasoning

Apple researchers argue that what we often refer to as "reasoning" may, in fact, be little more than sophisticated pattern-matching. Just as the hype around artificial general intelligence (A.G.I.) reaches a fever pitch, Apple has delivered a sobering reality check to the industry. In a research paper titled "The Illusion of Thinking," published on June 6, the company argues that the most advanced A.I. models available today -- those billed as capable of "human-level reasoning" -- struggle with complex logic problems. Instead of genuinely thinking like humans, these models rely on pattern recognition: drawing from familiar cues in their training data and predicting the next step. When faced with unfamiliar or challenging tasks, the models either offer weak responses or fail entirely. Sign Up For Our Daily Newsletter Sign Up Thank you for signing up! By clicking submit, you agree to our <a href="http://observermedia.com/terms">terms of service</a> and acknowledge we may use your information to send you emails, product samples, and promotions on this website and other properties. You can opt out anytime. See all of our newsletters In a controlled study, Apple researchers tested large language models (LLMs) such as Anthropic's Claude 3.7 Sonnet, DeepSeek-V3, and their "reasoning-optimized" versions (Claude 3.7 with Thinking and DeepSeek-R1). The team applied classic logic puzzles like the Tower of Hanoi and River Crossing -- well-established benchmarks for testing A.I. algorithms, planning and reasoning capabilities. The Tower of Hanoi tests recursion and step-by-step problem-solving, while River Crossing puzzles assess an A.I.'s ability to plan and execute multi-step solutions. Apple's researchers categorized the puzzles into three difficulty levels: low (3 steps), medium (4-10 steps) and high (11-20 steps). While most models handled the simpler tasks with reasonable success, their performance dropped dramatically as the puzzles grew more complex -- regardless of model size, training method or computational power. Even when given the correct algorithm or allowed to use up to 64,000 tokens -- a large computational budget -- the models offered only shallow responses, and performance did not improve even with explicit access to the solution algorithm. Through this study, Apple researchers argue that what we often refer to as "reasoning" may, in fact, be little more than sophisticated pattern-matching. They describe this phenomenon as a "counterintuitive scaling limit," where models, despite having ample computational resources, exert less effort as the complexity increases. "Current evaluations focus primarily on established mathematical and coding benchmarks, emphasizing final answer accuracy," Apple wrote in a blog post about the findings. "However, this paradigm often suffers from data contamination and fails to provide insights into the structure and quality of reasoning traces. Our setup allows analysis not only of the final answers but also of the internal reasoning traces, offering insights into how Large Reasoning Models (LRMs) 'think.'" This study introduces much-needed rigor to a field often dominated by marketing hype, especially at a time when tech giants are touting A.G.I. as just around the corner. It may also explain Apple's more cautious approach to A.I. development. Apple reports its own A.I. progress at WWDC The research paper was dropped days before Apple's annual WWDC developers conference, which kicked off today. In the opening keynote, Apple executives unveiled the Foundation Models framework. This framework will enable developers to integrate A.I. models into their apps, facilitating capabilities such as image generation, text creation and natural language search. Apple also introduced Xcode 26, a major update to its developer toolkit, which now includes built-in support for integrating A.I. models like ChatGPT and Claude via API keys. This update allows developers to leverage A.I. models for tasks like writing code, generating tests and documentation, and debugging. Together, these announcements mark a significant step in Apple's A.I. strategy, aiming to empower developers to build intelligent applications without relying on cloud infrastructure.

[21]

Apple Paper questions path to AGI, sparks division in GenAI group

New Delhi: A recent research paper from Apple focusing on the limitations of large reasoning models in artificial intelligence has left the generative AI community divided, sparking significant debate whether the current path taken by AI companies towards artificial general intelligence is the right one to take. The paper, titled The Illusion of Thinking, published earlier this week, demonstrates that even the most sophisticated large reasoning models do not genuinely think or reason in a human-like way. Instead, they excel at pattern recognition and mimicry, generating responses that only appear intelligent, but lack true comprehension or conceptual understanding. The study used controlled puzzle environments, such as the popular Tower of Hanoi puzzle, to systematically test reasoning abilities across varying complexities by large reasoning models such as OpenAI's o3 Mini, DeepSeek's R1, Anthropic's Claude 3.7 Sonnet and Google Gemini Flash. The findings show that while large reasoning and language models may handle simple or moderately complex tasks, they experience total failure when faced with high-complexity problems, which occur despite having sufficient computational resources. Gary Marcus, a cognitive scientist and a known sceptic of the claims surrounding large language models, views Apple's work as providing compelling empirical evidence that today's models primarily repeat patterns learned during training from vast datasets without genuine understanding or true reasoning capabilities. "If you can't use a billion-dollar AI system to solve a problem that Herb Simon (one of the actual godfathers of AI, current hype aside) solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote," Marcus wrote in his blog. Marcus' arguments are also echoed in earlier comments of Meta's chief AI scientist Yann LeCun, who has argued that current AI systems are mainly sophisticated pattern recognition tools rather than true thinkers. The release of Apple's paper ignited a polarised debate across the broader AI community, with many panning the design of the study than its findings. A published critique of the paper by researchers from Anthropic and San-Francisco based Open Philanthropy said the study has issues in the experimental design, that it overlooks output limits. In an alternate demonstration, the researchers tested the models on the same problems but allowed them to use code, resulting in high accuracy across all the tested models. The critique around the study's failure to take in the output limits and the limitations in coding by the models have also been highlighted by other AI commentators and researchers including Matthew Berman, a popular AI commentator and researcher. "SOTA models failed The Tower of Hanoi puzzle at a complexity threshold of >8 discs when using natural language alone to solve it. However, ask it to write code to solve it, and it flawlessly does up to seemingly unlimited complexity," Berman wrote in a post on X (formerly Twitter). The study highlights Apple's more cautious approach to AI compared to rivals like Google and Samsung, who have aggressively integrated AI into their products. Apple's research explains its hesitancy to fully commit to AI, contrasting with the industry's prevailing narrative of rapid progress. Many questioned the timing of the release of the study, coinciding with Apple's annual WWDC event where it announces its next software updates. Chatter across online forums said the study was more about managing expectations in light of Apple's own struggles with AI. That said, practitioners and business users argue that the findings do not change the immediate utility of AI tools for everyday applications.

[22]

Digit

Apple research claims popular AI models fail at hard reasoning: Why does it matter?

Synthetic benchmarks, though valuable, overstate AI limitations by ignoring real-world tools Over the weekend, Apple released new research that accuses most advanced generative AI models from the likes of OpenAI, Google and Anthropic of failing to handle tough logical reasoning problems. Apple's researchers claim to prove how most large reasoning models (or LRMs) simply "give up" when tasked with hard puzzle solving tasks, thereby exposing a major pitfall in GenAI's reasoning capabilities - as they exist for most parts in most LLM-based chatbots we've all gotten used to over the past couple of years. In their recent paper, "The Illusion of Thinking," Apple's researchers pull back the curtain on how large reasoning models (LRMs) completely mess up on difficult reasoning tasks. Basing their paper on GenAI's ability to solve certain algorithmic puzzles, Apple's researchers paint a stark picture - that these models start strong on easy and medium-difficulty problems, then simply "give up" once complexity crosses a threshold. Also read: Humanity's Last Exam Explained - The ultimate AI benchmark that sets the tone of our AI future But before we declare AI reasoning officially broken, it's worth asking - how much of this collapse reflects reality, and how much is an artifact of the puzzles themselves? Apple's critique of benchmarks like GSM8K and MATH starts with a valid point. The paper says that too often models memorize leaked test data, inflating our sense of their reasoning prowess. In order to combat this, Apple devised four classic algorithmic puzzles - Tower of Hanoi, River Crossing, Blocks World, and Checker Jumping - each scalable in precise steps while holding the basic logic constant. This lets them track not just whether a model gets the right answer, but also the length and structure of its "chain-of-thought" token traces. By testing top-tier LRMs - OpenAI's o-series, Anthropic's Claude, Google's Gemini 2.5, among others - Apple researchers saw a consistent pattern. First, in low-complexity puzzles, vanilla next-token models sometimes outdo "reasoning" variants. Secondly, in medium-complexity puzzles, chain-of-thought prompting gives LRMs an edge. However, once you enter the high-complexity problem solving, accuracy crashes to near zero, no matter how many tokens you throw at the problem. Also read: Mark Zuckerberg says AI will write most of Meta's AI code by 2026 The most alarming result is what Apple calls the "accuracy cliff." Once a puzzle's compositional steps exceed a hidden breakpoint - which is unique to each AI model - success rates plummet instantly. Equally telling is what happens to the models' token-level reasoning traces. Rather than lengthening their chains to tackle harder steps, LRMs start shortening them - a clear "giving up" heuristic, as Apple frames it. This echoes earlier stress-test research on SAT solvers and arithmetic models, where sharp performance drops past a certain complexity have been reported - especially related to mathematical problems. And independent code-generation studies have come to the same conclusion in the past - when faced with too many lines of logic, GenAI models produce plausible but incomplete code, where in order to be concise, they might be leaving out important details, thereby making them less accurate as a result. Apple's puzzle approach effectively demonstrates that chain-of-thought prompting has clear scaling limitations, as performance gains diminish rapidly beyond moderate difficulty levels, while their use of focused synthetic tasks ensures benchmark integrity by avoiding the contamination issues that plague popular evaluation suites where models may have memorized training data. Perhaps most significantly, their token-trace analysis reveals that these models don't simply process slowly when facing complex problems - they actively reduce their own reasoning pathways when they detect they're heading toward a dead end, suggesting a more sophisticated but potentially limiting form of self-regulation in AI reasoning processes. Here's where the Apple paper risks overreach. Because whether you like it or not, algorithmic puzzles live in a vacuum, stripped of the rich context, domain heuristics, and external tools (think calculators, retrieval systems, or symbolic engines) that real-world tasks allow. Few of us solve problems solely by chaining logic in our heads these days - we Google, we scribble, we offload heavy lifting to spreadsheets or math libraries. Hybrid architectures - think of retrieval-augmented (RAG) models - can dynamically fetch facts or calculate precisely, helping fill the gaps in general reasoning to more focused reasoning, and thereby plugging the gaps Apple's puzzles evaluation expose. By focusing narrowly on standalone LRMs, the Apple paper sidesteps these more robust systems which are quickly becoming the norm. Also read: Deepseek to Qwen: Top AI models released in 2025 Apple's experiments reportedly use off-the-shelf models with minimal prompt optimization. But anyone who's ever used ChatGPT or Gemini chatbot knows that careful prompt engineering or targeted follow-up commands can push output quality considerably higher, and it's true in benchmarks as well. In other words, the reasoning collapse of an AI model that Apple's alluding to might shift further up the complexity curve, rather than vanish entirely. Also, interpreting shorter reasoning chains as outright failure can be problematic. Models often prune redundant intermediate steps when they suspect a pattern, aiming for a concise but correct response. In such a scenario, token economy isn't necessarily a cry of defeat - it can also be a sign of increasing efficiency. We need finer metrics - perhaps measuring whether the pruned steps eliminate critical logical nodes or merely trim fluff - before diagnosing an exhaustion syndrome. All said and done, Apple's "The Illusion of Thinking" is a welcome reality check. It reminds us that shiny demos on sanitized benchmarks can lull us into overconfidence. The paper's controlled puzzles unveil genuine cracks in standalone LLM reasoning, and its trace analyses offer a compelling new window into model behavior. But it's important also to note that these puzzles are not the final word on AI's reasoning future. Real-world intelligence rarely reduces to pure logic chains, as retrieval skills, tools at hand, and human ingenuity all play a part when we're trying to get something done. If we want AI that truly "thinks," we must broaden our evaluation horizons, test hybrid systems in pragmatic scenarios, and refine our metrics to capture both depth and efficiency of reasoning.

Twitter

Facebook

Copy Link

Apple researchers find that advanced AI reasoning models struggle with complex problem-solving, suggesting fundamental limitations in their ability to generalize reasoning like humans do.

Apple Researchers Challenge AI Reasoning Capabilities

A new study from Apple researchers has cast doubt on the capabilities of advanced AI reasoning models, challenging claims about imminent artificial general intelligence (AGI). The research, titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," was conducted by a team led by Parshin Shojaee and Iman Mirzadeh 1

Source: Gadgets 360

Study Methodology and Findings

The researchers examined "large reasoning models" (LRMs), including OpenAI's o1 and o3, DeepSeek-R1, and Claude 3.Sonnet Thinking. These models attempt to simulate logical reasoning through a process called "chain-of-thought reasoning" 1

. The study used four classic puzzles - Tower of Hanoi, checkers jumping, river crossing, and blocks world - scaled from easy to extremely complex 1

Source: Mashable

Key findings include:

On simple tasks, standard models outperformed reasoning models.
For moderately difficult tasks, reasoning models had an advantage.
On highly complex tasks, both types of models failed completely 1
1
3
3
.

The researchers also observed a "counterintuitive scaling limit" where reasoning models initially generated more thinking tokens as problem complexity increased, but then reduced their reasoning effort beyond a certain threshold 1

Implications for AI Development

These results align with a recent study by the United States of America Mathematical Olympiad (USAMO), which found that the same models achieved low scores on novel mathematical proofs 1

. Both studies documented severe performance degradation on problems requiring extended systematic reasoning.

AI researcher Gary Marcus, known for his skepticism, called the Apple results "pretty devastating to LLMs" 1

. The study provides empirical support for the argument that neural networks struggle with out-of-distribution generalization.

Competing Interpretations

Not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. Some argue that the observed limitations may reflect deliberate training constraints rather than inherent inabilities 1

University of Toronto economist Kevin A. Bryan suggested that models are specifically trained through reinforcement learning to avoid excessive computation, which could explain the observed behavior 1

. Software engineer Sean Goedecke offered a similar critique, noting that when faced with extremely complex tasks, models like DeepSeek-R1 may decide that generating all moves manually is impossible and attempt to find shortcuts 1

Broader Context and Industry Claims

The study's findings contrast sharply with recent claims by AI industry leaders. Sam Altman of OpenAI and Demis Hassabis of Google DeepMind have made bold predictions about AI capabilities in the 2030s, including solving high-energy physics problems and enabling space colonization 2

Source: The Register

However, researchers working with today's most advanced AI systems are finding a different reality. Even the best models are failing to solve basic puzzles that most humans find trivial, while the promise of AI that can "reason" seems to be overblown 2

Limitations and Future Directions

The Apple researchers acknowledge that their study represents only a "narrow slice" of potential reasoning tasks 5

. However, their findings suggest that current approaches to AI development may be encountering fundamental barriers to generalizable reasoning 4

As the AI industry continues to invest heavily in developing more advanced models, with reports of Meta planning a $15 billion investment to achieve "superintelligence" 2

, these research findings highlight the need for a critical examination of AI capabilities and limitations. The gap between industry claims and research findings underscores the importance of continued rigorous testing and evaluation of AI systems as they evolve.

References

Summarized by

Navi

[1]

Ars Technica

New Apple study challenges whether AI models truly "reason" through problems

[2]

New Scientist

Is superintelligent AI just around the corner, or just a sci-fi dream?

[3]

Tom's Hardware

Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models

[4]

The Register

Apple AI boffins pour cold water on reasoning models

[5]

Live Science

AI reasoning models aren't as smart as they were cracked up to be, Apple study claims

Recent Highlights

Today's Top Stories

Trump signs scaled-back AI executive order after tech industry pushback on oversight timeline

President Trump signed an executive order Tuesday establishing a 30-day voluntary review period for powerful AI models before public release. The order represents a significant scaling back from an earlier draft that proposed up to 90 days of government oversight, following objections from tech industry leaders including former White House AI czar David Sacks who argued longer timelines could hamper US competitiveness against China.

14 Sources

Policy and Regulation

3 hrs ago

Nvidia RTX Spark debuts in Asus ProArt P16: 1 petaflop AI performance meets compatibility doubts

Nvidia unveiled its RTX Spark chip at Computex 2026, powering new creator laptops from Asus, Dell, and Microsoft. The Asus ProArt P16 delivers 1 petaflop of AI performance with 128GB unified memory and 6,144 Blackwell RTX cores. But app compatibility issues for Windows on Arm and uncertain demand for AI agents raise questions about market success.

9 Sources

Technology

15 hrs ago

Nvidia CEO crowns Marvell Technology the next trillion-dollar company as stock explodes 24%

Marvell Technology shares surged over 24% after Nvidia CEO Jensen Huang declared the chipmaker could become the next trillion-dollar company. Speaking at Computex in Taipei alongside Marvell CEO Matt Murphy, Huang emphasized the critical role of Marvell's networking and connectivity chips in AI infrastructure as demand for autonomous AI models accelerates.

3 Sources

Business and Economy

7 hrs ago

OpenAI Codex expands with enterprise AI tools as knowledge workers surge to 5 million users

OpenAI released six role-specific plugins for Codex targeting white-collar professionals, alongside new Sites and Annotations features. Knowledge workers now represent 20% of Codex's 5 million weekly users and are adopting the agentic AI platform three times faster than developers. The move positions OpenAI to compete directly with Anthropic's Claude Code as AI agents shift from coding terminals to corporate workflows.

3 Sources

Technology

3 hrs ago

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

The Outpost

News

Tech Powerhouse

AI Tools

About Us Privacy Terms Content

Apple Study Challenges AI Reasoning Capabilities, Casting Doubt on AGI Claims

Apple Researchers Challenge AI Reasoning Capabilities

Study Methodology and Findings

Implications for AI Development

Competing Interpretations

Broader Context and Industry Claims

Limitations and Future Directions

References

New Apple study challenges whether AI models truly "reason" through problems

Is superintelligent AI just around the corner, or just a sci-fi dream?

Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models

Apple AI boffins pour cold water on reasoning models

AI reasoning models aren't as smart as they were cracked up to be, Apple study claims

Related Stories

Apple Study Reveals Limitations in AI's Mathematical Reasoning Abilities

Apple Research Exposes Fundamental Flaws in AI's Logical Reasoning Capabilities

The Turing Test Challenged: GPT-4's Performance Sparks Debate on AI Intelligence

Recent Highlights

Pope Leo XIV releases major AI encyclical calling for 'disarmament' of artificial intelligence

Apple's Siri overhaul for iOS 27 brings Gemini integration and standalone app to compete with ChatGPT

Nvidia unveils RTX Spark chip to chase $200B CPU market with AI agent PCs from Microsoft, Dell, and HP

Recent Highlights

Today's Top Stories

Trump signs scaled-back AI executive order after tech industry pushback on oversight timeline

Nvidia RTX Spark debuts in Asus ProArt P16: 1 petaflop AI performance meets compatibility doubts

Nvidia CEO crowns Marvell Technology the next trillion-dollar company as stock explodes 24%

OpenAI Codex expands with enterprise AI tools as knowledge workers surge to 5 million users