Curated by THEOUTPOST
On Thu, 17 Oct, 1:08 PM UTC
2 Sources
[1]
Microsoft's Differential Transformer cancels attention noise in LLMs
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Improving the capabilities of large language models (LLMs) in retrieving in-prompt information remains an area of active research that can impact important applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. Transformers and the "lost-in-the-middle" phenomenon The Transformer architecture is the foundation of most modern LLMs. It uses an attention mechanism to weigh the importance of different parts of the input sequence when generating output. The attention mechanism employs the softmax function, which normalizes a vector of values into a probability distribution. In Transformers, the softmax function assigns attention scores to different tokens in the input sequence. However, studies have shown that Transformers struggle to retrieve key information from long contexts. "We began by investigating the so-called 'lost-in-the-middle' phenomenon," Furu Wei, Partner Research Manager at Microsoft Research, told VentureBeat, referring to previous research findings that showed that LLMs "do not robustly make use of information in long input contexts" and that "performance significantly degrades when models must access relevant information in the middle of long contexts." Wei and his colleagues also observed that some LLM hallucinations, where the model produces incorrect outputs despite having relevant context information, correlate with spurious attention patterns. "For example, large language models are easily distracted by context," Wei said. "We analyzed the attention patterns and found that the Transformer attention tends to over-attend irrelevant context because of the softmax bottleneck." The softmax function used in Transformer's attention mechanism tends to distribute attention scores across all tokens, even those that are not relevant to the task. This can cause the model to lose focus on the most important parts of the input, especially in long contexts. "Previous studies indicate that the softmax attention has a bias to learn low-frequency signals because the softmax attention scores are restricted to positive values and have to be summed to 1," Wei said. "The theoretical bottleneck renders [it] such that the classic Transformer cannot learn sparse attention distributions. In other words, the attention scores tend to flatten rather than focusing on relevant context." Differential Transformer To address this limitation, the researchers developed the Diff Transformer, a new foundation architecture for LLMs. The core idea is to use a "differential attention" mechanism that cancels out noise and amplifies the attention given to the most relevant parts of the input. The Transformer uses three vectors to compute attention: query, key, and value. The classic attention mechanism performs the softmax function on the entire query and key vectors. The proposed differential attention works by partitioning the query and key vectors into two groups and computing two separate softmax attention maps. The difference between these two maps is then used as the attention score. This process eliminates common noise, encouraging the model to focus on information that is pertinent to the input. The researchers compare their approach to noise-canceling headphones or differential amplifiers in electrical engineering, where the difference between two signals cancels out common-mode noise. While Diff Transformer involves an additional subtraction operation compared to the classic Transformer, it maintains efficiency thanks to parallelization and optimization techniques. "In the experimental setup, we matched the number of parameters and FLOPs with Transformers," Wei said. "Because the basic operator is still softmax, it can also benefit from the widely used FlashAttention cuda kernels for acceleration." In retrospect, the method used in Diff Transformer seems like a simple and intuitive solution. Wei compares it to ResNet, a popular deep learning architecture that introduced "residual connections" to improve the training of very deep neural networks. Residual connections made a very simple change to the traditional architecture yet had a profound impact. "In research, the key is to figure out 'what is the right problem?'" Wei said. "Once we can ask the right question, the solution is often intuitive. Similar to ResNet, the residual connection is an addition, compared with the subtraction in Diff Transformer, so it wasn't immediately apparent for researchers to propose the idea." Diff Transformer in action The researchers evaluated Diff Transformer on various language modeling tasks, scaling it up in terms of model size (from 3 billion to 13 billion parameters), training tokens, and context length (up to 64,000 tokens). Their experiments showed that Diff Transformer consistently outperforms the classic Transformer architecture across different benchmarks. A 3-billion-parameter Diff Transformer trained on 1 trillion tokens showed consistent improvements of several percentage points compared to similarly sized Transformer models. Further experiments with different model sizes and training dataset sizes confirmed the scalability of Diff Transformer. Their findings suggest that in general, Diff Transformer requires only around 65% of the model size or training tokens needed by a classic Transformer to achieve comparable performance. The researchers also found that Diff Transformer is particularly effective in using increasing context lengths. It showed significant improvements in key information retrieval, hallucination mitigation, and in-context learning. While the initial results are promising, there's still room for improvement. The research team is working on scaling Diff Transformer to larger model sizes and training datasets. They also plan to extend it to other modalities, including image, audio, video, and multimodal data. The researchers have released the code for Diff Transformer, implemented with different attention and optimization mechanisms. They believe the architecture can help improve performance across various LLM applications. "As the model can attend to relevant context more accurately, it is expected that these language models can better understand the context information with less in-context hallucinations," Wei said. "For example, for the retrieval-augmented generation settings (such as Bing Chat, Perplexity, and customized models for specific domains or industries), the models can generate more accurate responses by conditioning on the retrieved documents."
[2]
Adding Noise Cancellation to LLMs
With DIFF Transformer, you can achieve 30% accuracy improvement and 10-20% accuracy gain in many-shot in-context learning across datasets. LLMs are great for summarisation but not so clever when the answer is hidden within a pile of documents. This is because the traditional Transformer models sometimes struggle to effectively allocate attention. They often focus too much on irrelevant parts of the input data, which can degrade the quality of the model's outputs. This problem is known as "attention noise". Sure, the idea of scaling up traditional Transformers to improve performance is good, but it typically requires significantly more computational resources and training data, which can be costly and inefficient. To solve this problem, Microsoft recently released a research paper in which researchers have proposed a fresh approach to calculating attention. It focuses more accurately on relevant information and reduces the influence of irrelevant data. Novak I Zukowski, a research scientist, explained that the DIFF Transformer (Differential Transformer) introduces a differential attention module that replaces conventional softmax attention. It uses a differential denoising approach to amplify relevant signals and suppress noise. The attention head score analysis reveals that this architecture significantly improves focus on pertinent information while reducing attention to irrelevant context, leading to better efficiency, improved context utilisation, and more order-invariant reasoning performance. "The feed-forward network module remains similar to standard Transformers, with the primary innovation centred on the attention mechanism. Despite a slight decrease in raw throughput, DIFF Transformer demonstrates superior performance across various tasks, particularly in long-context scenarios, and shows promise for better quantisation due to reduced activation outliers," he added. The approach is simple. They are essentially learning two different projections for attention: one to actually attend, and the second to act as a reference for noise cancellation. When attention is calculated, the difference can be taken to keep the signal and lose the noise. A Reddit user said this is possible because both the weights and the scaling for taking the difference are trained in parallel with the rest of the model. The specialisation of the functional blocks occurs much as it does for neurons within a layer of a regular neural net. "The two sets of weights learn different things. The second/negative set of weights is constrained by the softmax function to be unable to direct attention towards specific tokens. Doing so would require producing a negative value, and softmax output values are in the [0,1] range. So, the only thing the second set of values can productively learn to do is to suppress noise," he added. The DIFF Transformer introduces differential attention to improve accuracy and reduce hallucinations by filtering noise. It uses 35-40% fewer parameters and handles long contexts up to 64K tokens effectively, enhancing in-context learning. Additionally, it supports low-bit quantisation without performance loss, offering a promising upgrade for language models with better long-context comprehension. This could lead to more efficient and accurate LLMs in the future. Rohan Paul, an AI engineer, mentioned in his recent post on X that apart from reducing noise, the DIFF Transformer also shows remarkable improvement in performance. For example, 30% accuracy improvement in key information retrieval with 64K context, 10-20% accuracy gain in many-shot in-context learning across datasets, 7-11% reduction in the hallucination for summarisation and question answering, and it also maintains performance with 6-bit quantisation, while Transformer degrades significantly. Apart from the Differential Transformer research paper, there are other efforts to reduce noise. For example, Denoising LM, a research paper from Apple, aims to use LLMs to clean up and correct errors in speech recognition outputs, significantly improving accuracy even in challenging, noisy environments. This means we can expect more tech which is based on LLMs and requires the least noisy output possible. And that is where approaches like DIFF Transformer come into play.
Share
Share
Copy Link
Microsoft Research and Tsinghua University introduce the Differential Transformer, a new LLM architecture that improves performance by reducing attention noise and enhancing focus on relevant context.
Microsoft Research, in collaboration with Tsinghua University, has introduced a groundbreaking architecture for Large Language Models (LLMs) called the Differential Transformer (Diff Transformer). This innovation aims to address the limitations of traditional Transformer models, particularly in handling long contexts and reducing attention noise 1.
Conventional Transformer models often struggle with effectively allocating attention, a phenomenon known as the "lost-in-the-middle" problem. This issue can lead to degraded performance, especially when dealing with long input contexts. Furu Wei, Partner Research Manager at Microsoft Research, explained that LLMs can be easily distracted by irrelevant context, resulting in over-attention to non-essential information 1.
The Diff Transformer introduces a novel "differential attention" mechanism that effectively cancels out noise while amplifying attention to relevant parts of the input. This approach involves:
This process is analogous to noise-canceling headphones or differential amplifiers in electrical engineering, where common-mode noise is eliminated by comparing two signals 1.
Experiments have shown that the Diff Transformer consistently outperforms classic Transformer architectures across various benchmarks. Key improvements include:
Notably, the Diff Transformer achieves comparable performance to classic Transformers while requiring only about 65% of the model size or training tokens 1.
The introduction of the Diff Transformer has significant implications for the field of AI:
Improved long-context comprehension: The architecture can effectively handle contexts up to 64,000 tokens, addressing a major limitation of current LLMs 2.
Enhanced efficiency: With 35-40% fewer parameters, the Diff Transformer offers a more resource-efficient alternative to traditional models 2.
Better quantization support: The architecture maintains performance with 6-bit quantization, while traditional Transformers see significant degradation 2.
As the AI community continues to explore ways to improve LLM performance, the Diff Transformer represents a promising step towards more accurate and efficient language models. Its ability to reduce noise and enhance focus on relevant information could lead to significant advancements in various AI applications, from summarization to question-answering systems.
Reference
[2]
Inception Labs introduces Mercury, a diffusion-based large language model that generates text up to 10 times faster than traditional Transformer models, potentially revolutionizing AI text generation.
3 Sources
3 Sources
Google and Sakana AI unveil new AI architectures, Titans and Transformer Squared, that challenge the dominance of traditional Transformer models by introducing brain-inspired mechanisms for improved memory, adaptability, and efficiency in large language models.
3 Sources
3 Sources
Diffbot launches a fine-tuned version of Meta's Llama 3.3, using Graph Retrieval-Augmented Generation to enhance AI responses with up-to-date information from its vast Knowledge Graph.
2 Sources
2 Sources
Liquid AI, an MIT spinoff, introduces Liquid Foundation Models (LFMs), a novel AI architecture that combines Transformer and Mamba models, offering superior performance and efficiency compared to traditional large language models.
3 Sources
3 Sources
NVIDIA's latest Deep Learning Super Sampling (DLSS) 4 technology marks a significant leap in AI-driven graphics enhancement, introducing transformer models and Multi-Frame Generation for improved image quality and performance in gaming.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved