2 Sources
2 Sources
[1]
Why does AI fail at basic multiplication?
New research reveals why even state-of-the-art large language models stumble on seemingly easy tasks -- and what it takes to fix it. These days, large language models can handle increasingly complex tasks, writing complex code and engaging in sophisticated reasoning. But when it comes to four-digit multiplication, a task taught in elementary school, even state-of-the-art systems fail. Why? A new paper by University of Chicago computer science PhD student Xiaoyan Bai and faculty codirector of the Data Science Institute's Novel Intelligence Research Initiative Chenhao Tan finds answers by reverse-engineering failure and success. They worked with collaborators from MIT, Harvard University, University of Waterloo and Google DeepMind to probe AI's "jagged frontier" -- a term for its capacity to excel at complex reasoning yet stumble on seemingly simple tasks. As you may remember (or have forgotten), multiplying larger numbers requires carrying over digits, and mentally "holding on" to partial products so you can add them up to get your final total. Processes that require storing information for later use in this way are called "long-range dependencies." Standard large language models work by learning to recognize patterns in the data they're trained on. But the more complex a problem gets, the less likely a model is to have seen it specifically. So how do you teach a model to not just memorize answers but learn a process? Models are often taught new tasks with a process known as standard fine-tuning, which relies on scaling up the training data, or adding more steps or "layers." But even when the research team tested models with two layers all the way up to 12 layers, they all achieved less than 1% accuracy when multiplying two four-digit numbers. The standard approaches were clearly failing, and researchers wanted to understand why. They found that under the standard approach, models converge on a "local optimum," or what they identify as the best solution in each dataset. But tasks like multi-digit multiplication require a model to be able to remember earlier computations while producing later digits. Without an architecture that can store and retrieve intermediate information, a model gets stuck, unable to move beyond that local optimum -- no matter how long it trains or how large it scales. Next, the researchers identified a model trained using a different method: Implicit Chain of Thought (ICoT). Where standard fine-tuning achieved less than 1% accuracy, the ICoT model was able to achieve 100% accuracy. To understand what this approach was doing differently, the team took both apart to uncover some fundamental insights. First, they saw that the ICoT model learns to remember what matters. Unlike the standard fine-tuning model, the ICoT model learned to track those long-range dependencies, or the information it gradually put together to solve a problem. The team verified this by testing whether they could decode intermediate values, such as running sums, from the models' internal states. In the ICoT model, they could -- but in the standard model, they couldn't. The ICoT method gradually removes intermediate reasoning steps during training, in a sense forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens. Next, they saw the ICoT model organizes its attention into distinct pathways across time. Think of it like a well-organized filing system: In early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the values it needs to calculate each digit of the final answer. The result is an efficient internal structure for carrying out multiplication, one that never emerges in the standard model. Finally, and perhaps most remarkably, the researchers found the ICoT model internally represents these operations using elegant structures. Instead of treating digits as symbols alone, the model encodes them as wave-like patterns known as Fourier bases and organizes its arithmetic in a visual, spatial way. When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum -- something the researchers didn't program, but rather emerged naturally during training in the ICoT model. It's as if the successful model derived its own efficient mathematical language for arithmetic. The researchers reasoned that if the standard fine-tuning models failed because they lacked the right built-in guidance, then providing the right training signal should fix it. To test this, the team introduced a simple solution: an added training objective that teaches the model to track running sums at each step, allowing it to carry intermediate values and partial products forward. It turned out that making this one addition to the two-layer model that completely failed under standard training did the trick. The result: 99% accuracy without explicit chain-of-thought supervision. When the researchers examined the model's attention patterns, they found it had learned mechanisms similar to ICoT's -- structures that store and retrieve partial products as needed. The model also developed additional strategies, including a way to track multiple digit pairs at the same time. While multiplication might seem a specific kind of task, the findings illuminate fundamental aspects of how large language models learn and "think." The long-range dependency problem isn't unique to arithmetic -- it appears throughout language modeling and other sequential tasks. The UChicago team's approach asks foundational questions about the distinctions between memorization and learning, and what architectural constraints help or hinder models' performance. "As AI is increasingly integrated into critical decision-making, it's essential to understand its unique ways of learning and thinking," says Tan. "Our research is trying to chart that terrain." This paper's key contribution: Architectural insights and training techniques can overcome obstacles that scaling alone cannot address. The right built-in guidance, not just more parameters or data, are key to pushing AI capabilities forward. While the solution for the multiplication issue is task-specific, the researchers anticipate future work will develop more general approaches to improve learning on tasks requiring models to keep track of information across many steps.
[2]
Standard AI models fail simple math without specialized training
Large language models have struggled with multi-digit multiplication without specialized training methods, despite their ability to handle complex coding and reasoning tasks, according to a recent study. Research published on the arXiv preprint server by the University of Chicago's Xiaoyan Bai and Chenhao Tan, along with collaborators from MIT, Harvard University, the University of Waterloo, and Google DeepMind, identified the reasons for this limitation and found solutions. Standard large language models achieved less than 1% accuracy when multiplying two four-digit numbers, even with increased layers up to 12. These models converged on a "local optimum," failing to store and retrieve intermediate computations necessary for multi-digit multiplication, which are categorized as long-range dependencies. Conversely, a model trained with the Implicit Chain of Thought (ICoT) method achieved 100% accuracy. The ICoT model demonstrated an ability to track long-range dependencies and internalize reasoning processes by gradually removing intermediate reasoning steps during training. The research team decoded intermediate values, such as running sums, from the ICoT model's internal states, which was not possible with the standard fine-tuning model. The ICoT model organized its attention into distinct pathways, computing products of digit pairs in early layers and storing them in specific locations for retrieval in later layers. This created an efficient internal structure for multiplication. The study also found that the ICoT model represented operations using elegant structures, encoding digits as wave-like patterns (Fourier bases) and organizing arithmetic spatially. During multiplication of digit pairs, the model naturally utilized a geometric operation called a Minkowski sum, which was not explicitly programmed by the researchers. Researchers achieved 99% accuracy in a two-layer model by introducing a modified training objective that taught the model to track running sums at each step, thereby carrying intermediate values and partial products forward. This addition enabled the model to develop mechanisms similar to ICoT's, including storing and retrieving partial products and tracking multiple digit pairs simultaneously. Chenhao Tan said, "Our research is trying to chart that terrain." The study highlights that architectural insights and training techniques can overcome obstacles that scaling alone cannot address, emphasizing the importance of built-in guidance in advancing AI capabilities. The findings illuminate fundamental aspects of how large language models learn and "think," with the long-range dependency problem extending beyond arithmetic to other sequential tasks in language modeling.
Share
Share
Copy Link
Despite handling complex coding tasks, large language models fail at four-digit multiplication, achieving less than 1% accuracy with standard training. Researchers from University of Chicago, MIT, Harvard, and Google DeepMind discovered the culprit: models can't store and retrieve intermediate computations. But a specialized Implicit Chain of Thought method achieved 100% accuracy by teaching models to internalize reasoning processes.
Large language models have reached impressive heights in AI reasoning capabilities, writing sophisticated code and tackling complex problems. Yet when confronted with four-digit multiplication—a task elementary school students master—even state-of-the-art systems collapse. Research from University of Chicago PhD student Xiaoyan Bai and faculty member Chenhao Tan, working alongside collaborators from MIT, Harvard University, University of Waterloo, and Google DeepMind, has uncovered why this paradox exists and how to solve it
1
2
.
Source: Futurity
The phenomenon reflects what researchers call AI's "jagged frontier"—the capacity to excel at sophisticated reasoning while stumbling on seemingly simple tasks. Standard large language models achieved less than 1% accuracy when multiplying two four-digit numbers, even when tested with architectures ranging from two layers all the way up to 12 layers
1
. Scaling up training data or adding more computational layers didn't help. The standard approaches were clearly failing, and the research team wanted to understand exactly why.Multiplying larger numbers requires carrying over digits and mentally holding onto partial products before adding them to reach a final total. These processes that require storing information for later use are called long-range dependencies
1
. Standard large language models work by recognizing patterns in training data, but as problems grow more complex, the likelihood of a model having seen that specific problem diminishes. The challenge becomes teaching models to learn a process rather than simply memorize answers.Researchers discovered that under standard fine-tuning approaches, models converge on a "local optimum"—what they identify as the best solution within each dataset. However, tasks like AI multiplication require models to store and retrieve intermediate computations while producing later digits
2
. Without an architecture capable of handling this information flow, models get stuck, unable to move beyond that local optimum regardless of training duration or scale.The research team identified a model trained using Implicit Chain of Thought (ICoT), which achieved 100% accuracy compared to the dismal performance of standard methods
1
. By reverse-engineering both successful and failed approaches, they uncovered fundamental insights into how models can truly learn mathematical reasoning.The ICoT model learned to track long-range dependencies—the information it gradually assembles to solve problems. The team verified this by testing whether they could decode intermediate values, such as running sums, from the models' internal states. In the ICoT model, they could extract these values, but in the standard model, they couldn't
2
. The ICoT method gradually removes intermediate reasoning steps during training, forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.Related Stories
The ICoT model organizes its attention into distinct pathways across time, functioning like a well-organized filing system. In early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the values needed to calculate each digit of the final answer
2
. This efficient internal structure for carrying out multiplication never emerges in standard models.Perhaps most remarkably, researchers found the ICoT model internally represents operations using elegant structures. Instead of treating digits as symbols alone, the model encodes them as wave-like patterns known as Fourier bases and organizes arithmetic in a visual, spatial way
1
. When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum—something researchers didn't program but which emerged naturally during training. The successful model essentially derived its own efficient mathematical language for arithmetic.Reasoning that standard models failed because they lacked proper built-in guidance, researchers tested whether providing the right training signal could fix the problem. They introduced a modified training objective that teaches models to track running sums at each step, allowing them to carry intermediate values and partial products forward
2
.This single addition to a two-layer model that had completely failed under standard training produced remarkable results: 99% accuracy without explicit chain-of-thought reasoning
1
. The modified approach enabled the model to develop mechanisms similar to ICoT's, including storing and retrieving partial products and tracking multiple digit pairs simultaneously."Our research is trying to chart that terrain," said Chenhao Tan
2
. The study demonstrates that architectural insights and training techniques can overcome obstacles that scaling alone cannot address, emphasizing the importance of built-in guidance in advancing AI capabilities. The long-range dependency problem extends beyond arithmetic to other sequential tasks in language modeling, suggesting these findings could have broader implications for improving large language models across diverse applications.Summarized by
Navi
[1]
13 Oct 2024•Science and Research

12 Nov 2024•Technology

11 Oct 2024•Science and Research

1
Business and Economy

2
Technology

3
Technology
