Large language models achieve under 1% accuracy at basic multiplication, new study reveals

Reviewed byNidhi Govil

2 Sources

Share

Despite handling complex coding tasks, large language models fail at four-digit multiplication, achieving less than 1% accuracy with standard training. Researchers from University of Chicago, MIT, Harvard, and Google DeepMind discovered the culprit: models can't store and retrieve intermediate computations. But a specialized Implicit Chain of Thought method achieved 100% accuracy by teaching models to internalize reasoning processes.

Large language models struggle with multi-digit multiplication despite advanced capabilities

Large language models have reached impressive heights in AI reasoning capabilities, writing sophisticated code and tackling complex problems. Yet when confronted with four-digit multiplication—a task elementary school students master—even state-of-the-art systems collapse. Research from University of Chicago PhD student Xiaoyan Bai and faculty member Chenhao Tan, working alongside collaborators from MIT, Harvard University, University of Waterloo, and Google DeepMind, has uncovered why this paradox exists and how to solve it

1

2

.

Source: Futurity

Source: Futurity

The phenomenon reflects what researchers call AI's "jagged frontier"—the capacity to excel at sophisticated reasoning while stumbling on seemingly simple tasks. Standard large language models achieved less than 1% accuracy when multiplying two four-digit numbers, even when tested with architectures ranging from two layers all the way up to 12 layers

1

. Scaling up training data or adding more computational layers didn't help. The standard approaches were clearly failing, and the research team wanted to understand exactly why.

Long-range dependencies expose fundamental limitations in standard fine-tuning

Multiplying larger numbers requires carrying over digits and mentally holding onto partial products before adding them to reach a final total. These processes that require storing information for later use are called long-range dependencies

1

. Standard large language models work by recognizing patterns in training data, but as problems grow more complex, the likelihood of a model having seen that specific problem diminishes. The challenge becomes teaching models to learn a process rather than simply memorize answers.

Researchers discovered that under standard fine-tuning approaches, models converge on a "local optimum"—what they identify as the best solution within each dataset. However, tasks like AI multiplication require models to store and retrieve intermediate computations while producing later digits

2

. Without an architecture capable of handling this information flow, models get stuck, unable to move beyond that local optimum regardless of training duration or scale.

Implicit Chain of Thought (ICoT) method achieves 100% accuracy breakthrough

The research team identified a model trained using Implicit Chain of Thought (ICoT), which achieved 100% accuracy compared to the dismal performance of standard methods

1

. By reverse-engineering both successful and failed approaches, they uncovered fundamental insights into how models can truly learn mathematical reasoning.

The ICoT model learned to track long-range dependencies—the information it gradually assembles to solve problems. The team verified this by testing whether they could decode intermediate values, such as running sums, from the models' internal states. In the ICoT model, they could extract these values, but in the standard model, they couldn't

2

. The ICoT method gradually removes intermediate reasoning steps during training, forcing the model to internalize the reasoning process in its hidden states rather than relying on explicit step-by-step tokens.

Attention patterns reveal elegant mathematical structures emerge naturally

The ICoT model organizes its attention into distinct pathways across time, functioning like a well-organized filing system. In early layers, the model computes products of digit pairs and stores them at specific locations. In later layers, it retrieves exactly the values needed to calculate each digit of the final answer

2

. This efficient internal structure for carrying out multiplication never emerges in standard models.

Perhaps most remarkably, researchers found the ICoT model internally represents operations using elegant structures. Instead of treating digits as symbols alone, the model encodes them as wave-like patterns known as Fourier bases and organizes arithmetic in a visual, spatial way

1

. When multiplying digit pairs, the model uses a natural geometric operation called a Minkowski sum—something researchers didn't program but which emerged naturally during training. The successful model essentially derived its own efficient mathematical language for arithmetic.

Modified training objective enables breakthrough without explicit chain-of-thought

Reasoning that standard models failed because they lacked proper built-in guidance, researchers tested whether providing the right training signal could fix the problem. They introduced a modified training objective that teaches models to track running sums at each step, allowing them to carry intermediate values and partial products forward

2

.

This single addition to a two-layer model that had completely failed under standard training produced remarkable results: 99% accuracy without explicit chain-of-thought reasoning

1

. The modified approach enabled the model to develop mechanisms similar to ICoT's, including storing and retrieving partial products and tracking multiple digit pairs simultaneously.

"Our research is trying to chart that terrain," said Chenhao Tan

2

. The study demonstrates that architectural insights and training techniques can overcome obstacles that scaling alone cannot address, emphasizing the importance of built-in guidance in advancing AI capabilities. The long-range dependency problem extends beyond arithmetic to other sequential tasks in language modeling, suggesting these findings could have broader implications for improving large language models across diverse applications.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo