6 Sources
[1]
Approaching WWDC, Apple researchers dispute claims that AI is capable of reasoning
While Apple has fallen behind the curve in terms of the AI features the company has actually launched, its researchers continue to work at the cutting edge of what's out there. In a new paper, they take issue with claims being made about some of the latest AI models - that they are actually capable of step-by-step reasoning. Apple say its tests show that this simply isn't true ... While it's acknowledged that conventional generative AI models, aka Large Language Models (LLMs), have no ability to reason, some AI companies are claiming that a new generation of models can. These are being referred to as Large Reasoning Models (LRMs). These grew out of attempts to have LLMs "show their work" - that is, lay out the individual steps taken to reach their conclusions. The idea is that if an AI can be forced to develop a chain of thought, and to take things one step at a time, that will stop them either making things up entirely or going off the rails at some point in their claims. Some big claims are being made for this approach, but a new Apple research paper calls this "the illusion of thinking." They argue that testing a range of LRMs shows that their "reasoning" quickly falls apart even with relatively simple logic challenges that are easy to solve algorithmically, like the Tower of Hanoi puzzle. Tower of Hanoi is a puzzle featuring three pegs and n disks of different sizes stacked on the first peg in size order (largest at bottom). The goal is to transfer all disks from the first peg to the third peg. Valid moves include moving only one disk at a time, taking only the top disk from a peg, and never placing a larger disk on top of a smaller one. You can create simpler or more complex versions of the game by varying the number of disks. What they found is that LRMs are actually worse than LLMs at the simplest versions of the puzzle, are slightly but not dramatically better when more discs are added - then fail completely with more than eight disks. Simple problems (N=1-3) show early accuracy declining over time (overthinking), moderate problems (N=4-7) show slight improvement in accuracy with continued reasoning, and complex problems (N≥8) exhibit consistently near-zero accuracy, indicating complete reasoning failure, meaning that the model fails to generate any correct solutions within the thought. In fact, they demonstrated that LRMs fail even when you give them the algorithm needed to solve it! They say that these findings cast doubt on claims being made about the latest AI models. These insights challenge prevailing assumptions about LRM capabilities [...] Our findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds. New York University professor emeritus of psychology and neural science Gary Marcus - who has long argued that LRMs are incapable of reasoning - said that it shows that we need to move beyond the hope that making more and more capable LLMs will eventually result in intelligence.
[2]
Apple Research Questions AI Reasoning Models Just Days Before WWDC
A newly published Apple Machine Learning Research study has challenged the prevailing narrative around AI "reasoning" large-language models like OpenAI's o1 and Claude's thinking variants, revealing fundamental limitations that suggest these systems aren't truly reasoning at all. For the study, rather than using standard math benchmarks that are prone to data contamination, Apple researchers designed controllable puzzle environments including Tower of Hanoi and River Crossing. This allowed a precise analysis of both the final answers and the internal reasoning traces across varying complexity levels, according to the researchers. The results are striking, to say the least. All tested reasoning models - including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet - experienced complete accuracy collapse beyond certain complexity thresholds, and dropped to zero success rates despite having adequate computational resources. Counterintuitively, the models actually reduce their thinking effort as problems become more complex, suggesting fundamental scaling limitations rather than resource constraints. Perhaps most damning, even when researchers provided complete solution algorithms, the models still failed at the same complexity points. Researchers say this indicates the limitation isn't in problem-solving strategy, but in basic logical step execution. Models also showed puzzling inconsistencies - succeeding on problems requiring 100+ moves while failing on simpler puzzles needing only 11 moves. The research highlights three distinct performance regimes: standard models surprisingly outperform reasoning models at low complexity, reasoning models show advantages at medium complexity, and both approaches fail completely at high complexity. The researchers' analysis of reasoning traces showed inefficient "overthinking" patterns, where models found correct solutions early but wasted computational budget exploring incorrect alternatives. The take-home of Apple's findings is that current "reasoning" models rely on sophisticated pattern matching rather than genuine reasoning capabilities. It suggests that LLMs don't scale reasoning like humans do, overthinking easy problems and thinking less for harder ones. The timing of the publication is notable, having emerged just days before WWDC 2025, where Apple is expected to limit its focus on AI in favor of new software designs and features, according to Bloomberg.
[3]
Apple Says Claude, DeepSeek-R1, and o3-mini Can't Really Reason | AIM
The researchers argue that traditional benchmarks, like math and coding tests, are flawed due to "data contamination" and fail to reveal how these models actually "think". AI critic Gary Marcus is smiling again, thanks to Apple. In a new paper titled The Illusion of Thinking, researchers from the Cupertino-based company argue that even the most advanced AI models, including the so-called large reasoning models (LRMs), don't actually think. Instead, they simulate reasoning without truly understanding or solving complex problems. The paper, released just ahead of Apple's Worldwide Developer Conference, tested leading AI models, including OpenAI's o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, using specially designed algorithmic puzzle environments rather than standard benchmarks. The researchers argue that traditional benchmarks, like math and coding tests, are flawed due to "data contamination" and fail to reveal how these models actually "think". "We show that state-of-the-art LRMs still fail to develop generalisable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments," the paper noted. Interestingly, one of the authors of the paper is Samy Bengio, the brother of Turing Award winner Yoshua Bengio. Yoshua recently launched LawZero, a Canada-based nonprofit AI safety lab working on building systems that prioritise truthfulness, safety, and ethical behaviour over commercial interests. The lab has secured around $30 million in initial funding from prominent backers, including former Google CEO Eric Schmidt's philanthropic organisation, Skype co-founder Jaan Tallinn, Open Philanthropy, and the Future of Life Institute. Backing the paper's claims, Marcus could not hold his excitement. "AI is not hitting a wall. But LLMs probably are (or at least a point of diminishing returns). We need new approaches, and to diversify the which roads are being actively explored." "I don't think LLMs are a good way to get there (AGI). They might be part of the answer, but I don't think they are the whole answer," Marcus said in a previous interaction with AIM, stressing that LLMs are not "useless". He also expressed optimism about AGI, describing it as a machine capable of approaching new problems with the flexibility and resourcefulness of a smart human being. "I think we'll see it someday," he further said. Taking a more balanced view, Ethan Mollick, professor at The Wharton School, said in a post on X, "I think the Apple paper on the limits of reasoning models in particular tests is useful & important, but the "LLMs are hitting a wall" narrative on X around it feels premature at best. Reminds me of the buzz over model collapse -- limitations that were overcome quickly in practice." He added that the current approach to reasoning likely has real limitations for a variety of reasons. However, the reasoning approaches themselves were made public less than a year ago. "There are just a lot of approaches that might overcome these issues. Or they may not. It's just very early." Hemanth Mohapatra, partner at Lightspeed India, said that the recent Apple paper showing reasoning struggles with complex problems confirms what many experts, like Yann LeCun, have long sensed. He acknowledged that while a new direction is necessary, current AI capabilities still promise significant productivity gains. "We do need a different hill to climb, but that doesn't mean existing capabilities won't have huge impact on productivity," he said. Meanwhile, Subbarao Kambhampati, professor at Arizona State University, who has been pretty vocal about LLMs' inability to reason and think, quipped that another advantage of being a university researcher in AI is, "You don't have to deal with either the amplification or the backlash as a surrogate for 'The Company'. Your research is just your research, fwiw." Instead of relying on familiar benchmarks, Apple's team used controlled puzzle environments, such as variants of the Tower of Hanoi, to precisely manipulate problem complexity and observe how models generate step-by-step "reasoning traces". This allowed them to see not just the final answer, but the process the model used to get there. The paper found that for simpler problems, non-reasoning models often outperformed more advanced LRMs, which tended to "overthink" and miss the correct answer. As the difficulty level rose to moderate, the reasoning models showed their strength, successfully following more intricate logical steps. However, when faced with truly complex puzzles, all models, regardless of their architecture, struggled and ultimately failed. Rather than putting in more effort, the AI responses grew shorter and less thoughtful, as if the models were giving up. While large language models continue to struggle with complex reasoning, that doesn't make them useless. Abacus.AI CEO Bindu Reddy pointed out on X, many people are misinterpreting the paper as proof that LLMs don't work. "All this paper is saying is LLMs can't solve arbitrarily hard problems yet," she said, adding that they're already handling tasks beyond the capabilities of most humans. The researchers suggest that what appears to be reasoning is often just the retrieval and adaptation of memorised solution templates from training data, not genuine logical deduction. When confronted with unfamiliar and highly complex problems, the models' reasoning abilities tend to collapse almost immediately, revealing that what appears to be reasoning is often just an illusion of thought. The study makes it clear that current large language models are still far from being true general-purpose reasoners. Their ability to handle reasoning tasks does not extend beyond a certain level of complexity, and even targeted efforts to train them with the correct algorithms result in only minor improvements. Andrew White, co-founder of FutureHouse, questioned Apple's approach, saying that its AI researchers seem to have adopted an "anti-LLM cynic ethos" by repeatedly publishing papers that argue reasoning LLMs are fundamentally limited and lack generalisation ability. He pointed out the irony, saying Apple has "the worst AI products" like Siri and Apple Intelligence, and admitted he has no idea what their actual strategy is. Apple's research serves as a cautionary message for AI developers and users alike. While today's chatbots and reasoning models appear impressive, their core abilities remain limited. As the paper puts it, "despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds." "We need models that can represent and manipulate abstract structures, not just predict tokens. Hybrid systems that combine LLMs with symbolic logic, memory modules, or algorithmic planners are showing early promise. These aren't just add-ons -- they reshape how the system thinks," said Pradeep Sanyal, AI and data leader at a global tech consulting firm, in a LinkedIn post. He further added that combining neural and symbolic parts isn't without drawbacks. It introduces added complexity around coordination, latency, and debugging. But the improvements in precision and transparency make it a direction worth exploring.
[4]
AI models still far from AGI-level reasoning: Apple researchers
Current "thinking" AI models still can't reason to a level that would be expected from humanlike artificial general intelligence, the researchers found. The race to develop artificial general intelligence (AGI) still has a long way to run, according to Apple researchers who found that leading AI models still have trouble reasoning. Recent updates to leading AI large language models (LLMs) such as OpenAI's ChatGPT and Anthropic's Claude have included large reasoning models (LRMs), but their fundamental capabilities, scaling properties, and limitations "remain insufficiently understood," said the Apple researchers in a June paper called "The Illusion of Thinking." They noted that current evaluations primarily focus on established mathematical and coding benchmarks, "emphasizing final answer accuracy." However, this evaluation does not provide insights into the reasoning capabilities of the AI models, they said. The research contrasts with an expectation that artificial general intelligence is just a few years away. The researchers devised different puzzle games to test "thinking" and "non-thinking" variants of Claude Sonnet, OpenAI's o3-mini and o1, and DeepSeek-R1 and V3 chatbots beyond the standard mathematical benchmarks. They discovered that "frontier LRMs face a complete accuracy collapse beyond certain complexities," don't generalize reasoning effectively, and their edge disappears with rising complexity, contrary to expectations for AGI capabilities. "We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles." They found inconsistent and shallow reasoning with the models and also observed overthinking, with AI chatbots generating correct answers early and then wandering into incorrect reasoning. Related: AI solidifying role in Web3, challenging DeFi and gaming: DappRadar The researchers concluded that LRMs mimic reasoning patterns without truly internalizing or generalizing them, which falls short of AGI-level reasoning. "These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning." AGI is the holy grail of AI development, a state where the machine can think and reason like a human and is on a par with human intelligence. In January, OpenAI CEO Sam Altman said the firm was closer to building AGI than ever before. "We are now confident we know how to build AGI as we have traditionally understood it," he said at the time. In November, Anthropic CEO Dario Amodei said that AGI would exceed human capabilities in the next year or two. "If you just eyeball the rate at which these capabilities are increasing, it does make you think that we'll get there by 2026 or 2027," he said.
[5]
Apple Researchers Find 'Accuracy Collapse' Problem in AI Reasoning Models
Claude 3.7 Sonnet and DeepSeek V3/R1 was chosen for this experiment Apple published a research paper on Saturday, where researchers examine the strengths and weaknesses of recently released reasoning models. Also known as large reasoning models (LRMs), these are the models that "think" by utilising additional compute to solve complex problems. However, the paper found that even the most powerful models struggle with a complexity issue. Researchers said that when a problem is highly complex, the models experience a total collapse and give up on the problem instead of using more compute, which is something they're trained to do. In a paper titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," published on Apple's website, the researchers claim both LRMs and large language models (LLMs) without thinking capability behave differently when faced with three regimes of complexity. The paper has described three regimes of complexity which are low complexity tasks, medium complexity tasks, and high complexity tasks. To test how LLMs and LRMs function when dealing with a wide range of complexities, the researchers decided to use several puzzles that can have an increasing level of difficulty. One puzzle in particular was the Tower of Hanoi. The Tower of Hanoi is a mathematical puzzle with three pegs and several disks. Disks are arranged in a decreasing order of size to create a pyramid-like shape. The objective of the puzzle is to shift the disks from the leftmost peg to the rightmost peg, while moving one disk at a time. There is a catch -- at no time should a larger disk be placed on top of a smaller disk. It is not a very difficult puzzle, and it is often targeted at children between the ages of six and 15. Apple researchers chose two reasoning models and their non-reasoning counterparts for this experiment. The LLMs chosen were Claude 3.7 Sonnet and DeepSeek-V3, while the LRMs were Claude 3.7 Sonnet with Thinking and DeepSeek-R1. The thinking budget was maximised at 64,000 tokens each. The aim of the experiment was not just to check the final accuracy, but also the accuracy in logic in choosing the steps to solve the puzzle. In the low complexity task, up to three disks were added, whereas for the medium complexity task, disk sizes were kept between four to 10. Finally, in the high complexity task, there were between 11-20 disks. The researchers noted that both LLMs and LRMs displayed equal aptitude in solving the low complexity task. When the difficulty was increased, reasoning models were able to solve the puzzle more accurately, given the extra budget of compute. However, when the tasks reached the high complexity zone, it was found that both models showed a complete collapse of reasoning. The same experiment was also said to be repeated with more models and more puzzles, such as Checkers Jumping, River Crossing, and Blocks World. Apple's research paper highlights the concerns that several others in the artificial intelligence (AI) space have already expressed. While reasoning models can generalise within their distributed datasets, whenever any problem falls beyond them, the models struggle in "thinking," and either try to take shortcuts in finding the solution, or completely give up and collapse. "Current evaluations primarily focus on established mathematical and coding benchmarks, emphasising final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces' structure and quality," the company said in a post.
[6]
Apple research claims popular AI models fail at hard reasoning: Why does it matter?
Synthetic benchmarks, though valuable, overstate AI limitations by ignoring real-world tools Over the weekend, Apple released new research that accuses most advanced generative AI models from the likes of OpenAI, Google and Anthropic of failing to handle tough logical reasoning problems. Apple's researchers claim to prove how most large reasoning models (or LRMs) simply "give up" when tasked with hard puzzle solving tasks, thereby exposing a major pitfall in GenAI's reasoning capabilities - as they exist for most parts in most LLM-based chatbots we've all gotten used to over the past couple of years. In their recent paper, "The Illusion of Thinking," Apple's researchers pull back the curtain on how large reasoning models (LRMs) completely mess up on difficult reasoning tasks. Basing their paper on GenAI's ability to solve certain algorithmic puzzles, Apple's researchers paint a stark picture - that these models start strong on easy and medium-difficulty problems, then simply "give up" once complexity crosses a threshold. Also read: Humanity's Last Exam Explained - The ultimate AI benchmark that sets the tone of our AI future But before we declare AI reasoning officially broken, it's worth asking - how much of this collapse reflects reality, and how much is an artifact of the puzzles themselves? Apple's critique of benchmarks like GSM8K and MATH starts with a valid point. The paper says that too often models memorize leaked test data, inflating our sense of their reasoning prowess. In order to combat this, Apple devised four classic algorithmic puzzles - Tower of Hanoi, River Crossing, Blocks World, and Checker Jumping - each scalable in precise steps while holding the basic logic constant. This lets them track not just whether a model gets the right answer, but also the length and structure of its "chain-of-thought" token traces. By testing top-tier LRMs - OpenAI's o-series, Anthropic's Claude, Google's Gemini 2.5, among others - Apple researchers saw a consistent pattern. First, in low-complexity puzzles, vanilla next-token models sometimes outdo "reasoning" variants. Secondly, in medium-complexity puzzles, chain-of-thought prompting gives LRMs an edge. However, once you enter the high-complexity problem solving, accuracy crashes to near zero, no matter how many tokens you throw at the problem. Also read: Mark Zuckerberg says AI will write most of Meta's AI code by 2026 The most alarming result is what Apple calls the "accuracy cliff." Once a puzzle's compositional steps exceed a hidden breakpoint - which is unique to each AI model - success rates plummet instantly. Equally telling is what happens to the models' token-level reasoning traces. Rather than lengthening their chains to tackle harder steps, LRMs start shortening them - a clear "giving up" heuristic, as Apple frames it. This echoes earlier stress-test research on SAT solvers and arithmetic models, where sharp performance drops past a certain complexity have been reported - especially related to mathematical problems. And independent code-generation studies have come to the same conclusion in the past - when faced with too many lines of logic, GenAI models produce plausible but incomplete code, where in order to be concise, they might be leaving out important details, thereby making them less accurate as a result. Apple's puzzle approach effectively demonstrates that chain-of-thought prompting has clear scaling limitations, as performance gains diminish rapidly beyond moderate difficulty levels, while their use of focused synthetic tasks ensures benchmark integrity by avoiding the contamination issues that plague popular evaluation suites where models may have memorized training data. Perhaps most significantly, their token-trace analysis reveals that these models don't simply process slowly when facing complex problems - they actively reduce their own reasoning pathways when they detect they're heading toward a dead end, suggesting a more sophisticated but potentially limiting form of self-regulation in AI reasoning processes. Here's where the Apple paper risks overreach. Because whether you like it or not, algorithmic puzzles live in a vacuum, stripped of the rich context, domain heuristics, and external tools (think calculators, retrieval systems, or symbolic engines) that real-world tasks allow. Few of us solve problems solely by chaining logic in our heads these days - we Google, we scribble, we offload heavy lifting to spreadsheets or math libraries. Hybrid architectures - think of retrieval-augmented (RAG) models - can dynamically fetch facts or calculate precisely, helping fill the gaps in general reasoning to more focused reasoning, and thereby plugging the gaps Apple's puzzles evaluation expose. By focusing narrowly on standalone LRMs, the Apple paper sidesteps these more robust systems which are quickly becoming the norm. Also read: Deepseek to Qwen: Top AI models released in 2025 Apple's experiments reportedly use off-the-shelf models with minimal prompt optimization. But anyone who's ever used ChatGPT or Gemini chatbot knows that careful prompt engineering or targeted follow-up commands can push output quality considerably higher, and it's true in benchmarks as well. In other words, the reasoning collapse of an AI model that Apple's alluding to might shift further up the complexity curve, rather than vanish entirely. Also, interpreting shorter reasoning chains as outright failure can be problematic. Models often prune redundant intermediate steps when they suspect a pattern, aiming for a concise but correct response. In such a scenario, token economy isn't necessarily a cry of defeat - it can also be a sign of increasing efficiency. We need finer metrics - perhaps measuring whether the pruned steps eliminate critical logical nodes or merely trim fluff - before diagnosing an exhaustion syndrome. All said and done, Apple's "The Illusion of Thinking" is a welcome reality check. It reminds us that shiny demos on sanitized benchmarks can lull us into overconfidence. The paper's controlled puzzles unveil genuine cracks in standalone LLM reasoning, and its trace analyses offer a compelling new window into model behavior. But it's important also to note that these puzzles are not the final word on AI's reasoning future. Real-world intelligence rarely reduces to pure logic chains, as retrieval skills, tools at hand, and human ingenuity all play a part when we're trying to get something done. If we want AI that truly "thinks," we must broaden our evaluation horizons, test hybrid systems in pragmatic scenarios, and refine our metrics to capture both depth and efficiency of reasoning.
Share
Copy Link
Apple researchers dispute claims that AI models are capable of reasoning, demonstrating fundamental limitations in current Large Reasoning Models (LRMs) through controlled puzzle experiments.
In a groundbreaking study titled "The Illusion of Thinking," Apple researchers have cast doubt on the claims that current AI models, including Large Reasoning Models (LRMs), are capable of genuine reasoning 1. The research, published just days before Apple's Worldwide Developer Conference (WWDC), challenges prevailing assumptions about AI capabilities and suggests fundamental limitations in current approaches to artificial general intelligence (AGI) 2.
Source: NDTV Gadgets 360
Apple's research team, which includes Samy Bengio, brother of Turing Award winner Yoshua Bengio, employed a novel approach to evaluate AI reasoning capabilities 3. Instead of relying on standard mathematical benchmarks, which are prone to data contamination, the researchers designed controllable puzzle environments, such as variations of the Tower of Hanoi 1. This method allowed for precise analysis of both final answers and internal reasoning traces across varying complexity levels.
The study revealed striking results across different complexity regimes:
Low Complexity Tasks: Surprisingly, standard models outperformed reasoning models, with LRMs showing a tendency to "overthink" simple problems 2.
Medium Complexity Tasks: Reasoning models demonstrated advantages, successfully following more intricate logical steps 3.
High Complexity Tasks: Both approaches experienced a complete accuracy collapse, dropping to zero success rates despite adequate computational resources 2.
Source: Digit
The research highlights several critical insights:
The findings have sparked discussions within the AI community:
The study's results suggest that the path to AGI may be longer and more complex than some have predicted. OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei have previously expressed optimism about rapid progress towards AGI, with timelines as short as 2026 or 2027 4. However, Apple's research indicates that current approaches may be encountering fundamental barriers to generalizable reasoning 4.
Source: 9to5Mac
While the research highlights significant limitations in current AI models, it's important to note that these findings don't negate the usefulness of large language models in various applications. As Bindu Reddy, CEO of Abacus, pointed out, LLMs are already handling tasks beyond the capabilities of most humans 3. The study does, however, underscore the need for new approaches and diversification in AI research to overcome the current limitations in reasoning and problem-solving capabilities.
Apple's annual Worldwide Developers Conference (WWDC) 2025 approaches amid concerns over the company's AI progress, regulatory challenges, and market position, as competitors forge ahead in the AI race.
13 Sources
Technology
13 hrs ago
13 Sources
Technology
13 hrs ago
Getty Images' lawsuit against Stability AI over copyright infringement in AI image generation begins in London, potentially setting a crucial precedent for AI and copyright law.
6 Sources
Policy and Regulation
5 hrs ago
6 Sources
Policy and Regulation
5 hrs ago
Nvidia CEO Jensen Huang lauds UK's AI talent and ecosystem, highlighting the need for digital infrastructure to capitalize on its potential. UK Prime Minister Keir Starmer announces £1 billion investment in AI computing power.
7 Sources
Technology
5 hrs ago
7 Sources
Technology
5 hrs ago
Prime Minister Keir Starmer announces a major AI skills drive in partnership with tech giants, aiming to train 7.5 million workers and boost AI education in schools to strengthen the UK's position as a global AI leader.
7 Sources
Technology
5 hrs ago
7 Sources
Technology
5 hrs ago
Qualcomm's $2.4 billion acquisition of Alphawave Semi aims to enhance its data center and AI technologies, marking another departure from the London Stock Exchange.
15 Sources
Business and Economy
5 hrs ago
15 Sources
Business and Economy
5 hrs ago