Curated by THEOUTPOST
On Fri, 28 Feb, 8:02 AM UTC
3 Sources
[1]
Diffusion LLMs Arrive : Is This the End of Transformer Large Language Models (LLMs)?
The development of large language models (LLMs) is entering a pivotal phase with the emergence of diffusion-based architectures. These models, spearheaded by Inception Labs through its new Mercury system, presenting a significant challenge to the long-standing dominance of Transformer-based systems. Mercury introduces a novel approach that promises faster token generation speeds while maintaining performance levels comparable to existing models. This innovation has the potential to reshape how artificial intelligence handles text, image, and video generation, paving the way for more advanced multimodal applications that could redefine the AI landscape. "Mercury is up to 10x faster than frontier speed-optimized LLMs. Our models run at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips. The Mercury family of diffusion large language models (dLLMs), a new generation of LLMs that push the frontier of fast, high-quality text generation. " Unlike Transformers, which generate text one token at a time, Mercury takes a bold leap by producing tokens in parallel, drastically cutting down response times. The result? Up to 10 times faster generation speeds without compromising on quality. But this isn't just about speed -- it's about unlocking new possibilities for AI, from real-time applications to multimodal capabilities like generating text, images, and even videos. If you've ever wondered what the future of AI might look like, you're in for an exciting ride. Diffusion-based LLMs represent a fundamental shift in how language is generated. Unlike Transformers, which rely on sequential autoregressive modeling to generate tokens one at a time, diffusion models operate by producing tokens in parallel. This approach is inspired by the diffusion processes used in image and video generation, where noise is incrementally removed to create coherent outputs. By adopting this parallel token generation strategy, diffusion-based LLMs aim to overcome the latency challenges associated with sequential processing. The result is a faster and potentially more scalable solution for generating high-quality outputs, making these models particularly appealing for applications requiring real-time performance. Inception Labs' Mercury model has set a new standard in LLM technology. Capable of generating up to 1,000 tokens per second on standard Nvidia hardware, Mercury is reportedly up to 10 times faster than even the most speed-optimized Transformer-based models. This remarkable performance leap is achieved without compromising the quality of the generated outputs, making Mercury an attractive option for tasks that demand rapid processing. Currently, Mercury is available in two specialized versions -- Mercury Coder Mini and Mercury Coder Small -- both tailored to meet the needs of developers working on coding-focused projects. These versions highlight Mercury's versatility and its potential to cater to niche applications while maintaining its core strengths. Browse through more resources below from our in-depth content covering more areas on large language models. Mercury has undergone rigorous benchmarking against leading Transformer-based models, including Gemini 2.0 Flashlight, GPT 40 Mini, and open-weight models like Quin 2.0 and Deep Coder V2 Light. While its overall performance aligns closely with smaller Transformer models, Mercury's parallel token generation gives it a distinct advantage in speed. This capability makes it particularly well-suited for applications requiring real-time responses or large-scale data processing, where efficiency and speed are critical. By addressing these specific needs, Mercury positions itself as a compelling alternative to traditional Transformer-based systems, especially in scenarios where latency reduction is a priority. The diffusion-based architecture of Mercury extends its utility far beyond text generation. Its ability to generate images and videos positions it as a versatile tool for industries exploring creative and multimedia applications. This multimodal capability opens up new possibilities for sectors such as entertainment, advertising, and content creation, where the demand for high-quality, AI-generated visuals is growing. Additionally, Mercury's enhanced reasoning capabilities and agentic workflows make it a strong candidate for tackling complex problem-solving tasks, such as advanced coding, data analysis, and decision-making processes. The parallel token generation mechanism further enhances its efficiency, allowing faster solutions across a wide range of use cases, from customer service chatbots to large-scale content generation systems. Despite its promise, Mercury is not without its challenges. Early versions of the model have shown difficulties in handling highly intricate or ambiguous prompts, which highlights areas where further refinement is necessary. Additionally, the current usage is capped at 10 requests per hour, a limitation that could hinder its adoption in high-demand environments. These constraints underscore the need for continued development and optimization to fully unlock the potential of diffusion-based LLMs. Addressing these early limitations will be crucial for Mercury to achieve broader adoption and to compete effectively with established Transformer-based systems. Inception Labs has ambitious plans to expand Mercury's reach by integrating it into APIs, allowing developers to seamlessly incorporate its capabilities into their workflows. This integration could accelerate innovation in LLM applications, fostering the development of more efficient and versatile AI systems. The success of Mercury also raises important questions about the future of LLM design, with diffusion-based models emerging as a viable alternative to the Transformer paradigm. As these models continue to mature, they may inspire a wave of new architectures that prioritize speed, scalability, and multimodal capabilities. While Mercury leads the charge in diffusion-based LLMs, it is not the only experimental architecture under development. Liquid AI's Liquid Foundation Models (LFMs) represent another attempt to move beyond Transformers. However, early results indicate that LFMs have yet to match Mercury's performance or efficiency. These efforts reflect a growing interest in diversifying LLM architectures to address the limitations of existing models. The exploration of alternative approaches, such as LFMs and diffusion-based systems, signals a broader shift in AI research, emphasizing the need for innovation to overcome the constraints of traditional Transformer-based designs. The advent of diffusion-based LLMs marks a significant milestone in the evolution of artificial intelligence. Mercury, with its parallel token generation and multimodal capabilities, challenges the dominance of Transformer-based systems by offering a faster and more versatile alternative. While still in its early stages, this innovation has the potential to reshape the future of AI, driving advancements in text, image, and video generation. As diffusion-based models continue to evolve, they may well define the next chapter in large language model development, pushing the boundaries of what AI can achieve across a wide array of applications.
[2]
The 'First Commercial Scale' Diffusion LLM Mercury Offers over 1000 Tokens/sec on NVIDIA H100
Built by Inception Labs, the model doesn't require specialised architecture to achieve the speed. For a long time, there's been an active discussion about exploring a better architecture for large language models (LLM) besides the transformer. Well, two months into 2025, this California-based startup seems to have a promising solution. Inception Labs, founded by professors from Stanford, the University of California, Los Angeles (UCLA), and Cornell, has introduced Mercury, which the company claims to be the first commercial-scale diffusion large language model. Mercury is ten times faster than current frontier models, according to an independent benchmarking platform, Artificial Analysis, the model's output speed exceeds 1000 tokens per second on NVIDIA H100 GPUs, a speed previously possible only using custom chips. "Transformers have dominated LLM text generation and generate tokens sequentially. This is a cool attempt to explore diffusion models as an alternative by generating the entire text at the same time using a coarse-to-fine process," Andrew Ng, founder of DeepLearning.AI, wrote in a post on X. Ng's last phrase is key to understanding why Inception Labs' approach seems interesting. Andrej Karpathy, a former researcher at OpenAI, who's currently leading Eureka Labs, helps us understand this better. In a post on X, he said that LLMs based on transformers are trained autoregressively, meaning predicting words (or tokens) from left to right. However, diffusion is a technique that AI models use to generate images and videos. "Diffusion is different - it doesn't go left to right, but all at once. You start with noise and gradually denoise into a token stream," added Karpathy. He also indicated that Mercury has the potential to be different and showcase new possibilities. And as per the company's testing - it does make a difference in the output speed. In the company's evaluation across standard coding benchmarks, Mercury surpasses the performance of speed-focused small models like GPT-4o Mini, Gemini 2.0 Flash and Claude 3.5 Haiku. The Mercury Coder Mini model achieved 1109 tokens per second. Source: Artificial Analysis Moreover, the startup also said diffusion models are advantageous in reasoning and structuring their responses because they are not restricted to considering only their previous outputs. Besides, they can continuously refine their output to reduce hallucinations and errors. Thus, diffusion techniques power the models under video generation tools like Sora and Midjourney. The company also took a subtle dig at the techniques used by current reasoning models and their bet on inference time scaling that uses additional compute while generating the output. "Generating long reasoning traces comes at the price of ballooning inference costs and unusable latency. A paradigm shift is needed to make high-quality AI solutions truly accessible," the company said. Inception Labs has released a preview version of the Mercury Coder, which allows users to test the model's capabilities. Small models optimised for speed are at threat - but what about specialised hardware providers like Groq, Cerebras and SambaNova? It isn't for no reason that NVIDIA achieved the status of the world's most valuable company during the age of the AI frenzy. Their GPUs are ubiquitously preferred for training AI models. However, the company's Achilles heel was providing low latency and high-speed outputs -- even Jensen Huang, CEO of NVIDIA, noted this. This opened up the opportunity for companies like Groq, Cerebras, and SambaNova to build hardware dedicated to high-speed outputs. However, Mercury's speed was only matched before by models hosted on specialised inference platforms -- for instance, Mistral's Le Chat running on Cerebras. Recently, Jonathan Ross, CEO of Groq, said that people will continue to buy NVIDIA GPUs for training, but high-speed inference will necessitate specialised hardware. Does Mercury's breakthrough suggest a threat to this ecosystem? Moreover, Inception Labs also said that diffusion LLMs are a replacement for all current use cases like RAG, tool use, and agentic workflows. But this isn't the first time a diffusion model for language has been explored. In 2022, a group of Stanford researchers published research on the same technique but observed that the inference was slow. "Interestingly, the main advantage now [with Mercury] is speed. Impressive to see how far diffusion LMs have come!" said Percy Liang, a Stanford professor comparing Mercury to the older study. Similarly, a group of researchers from China recently published a study on a diffusion language model they built called LLaDA. The researchers said that the 8 billion parameter version of this model offered competitive performance, and their benchmark evaluations revealed better performance in several tests compared to models in its category.
[3]
New AI text diffusion models break speed barriers by pulling words from noise
On Thursday, Inception Labs released Mercury Coder, a new AI language model that uses diffusion techniques to generate text faster than conventional models. Unlike traditional models that create text word by word -- such as the kind that powers ChatGPT -- diffusion-based models like Mercury produce entire responses simultaneously, refining them from an initially masked state into coherent text. Traditional large language models build text from left to right, one token at a time. They use a technique called "autoregression." Each word must wait for all previous words before appearing. Inspired by techniques from image-generation models like Stable Diffusion, DALL-E, and Midjourney, text diffusion language models like LLaDA (developed by researchers from Renmin University and Ant Group) and Mercury use a masking-based approach. These models begin with fully obscured content and gradually "denoise" the output, revealing all parts of the response at once. While image diffusion models add continuous noise to pixel values, text diffusion models can't apply continuous noise to discrete tokens (chunks of text data). Instead, they replace tokens with special mask tokens as the text equivalent of noise. In LLaDA, the masking probability controls the noise level, with high masking representing high noise and low masking representing low noise. The diffusion process moves from high noise to low noise. Though LLaDA describes this using masking terminology and Mercury uses noise terminology, both apply a similar concept to text generation rooted in diffusion. Much like the creation of an image synthesis model, researchers build text diffusion models by training a neural network on partially obscured data, having the model predict the most likely completion and then comparing the results with the actual answer. If the model gets it correct, connections in the neural net that led to the correct answer get reinforced. After enough examples, the model can generate outputs with high enough accuracy or plausibility to be useful. According to Inception Labs, its approach allows the model to refine outputs and address mistakes because it isn't limited to considering only previously generated text. This parallel processing enables Mercury's reported 1,000+ tokens per second generation speed on NVIDIA H100 GPUs. These diffusion models maintain performance faster than or comparable to similarly sized conventional models. LLaDA's researchers report their 8 billion parameter model performs similarly to LLaMA3 8B across various benchmarks, with competitive results on tasks like MMLU, ARC, and GSM8K. However, Mercury claims dramatic speed improvements. Their Mercury Coder Mini scores 88.0 percent on HumanEval and 77.1 percent on MBPP -- comparable to GPT-4o Mini -- while reportedly operating at 1,109 tokens per second compared to GPT-4o Mini's 59 tokens per second. This represents roughly a 19x speed advantage over GPT-4o Mini while maintaining similar performance on coding benchmarks. Mercury's documentation states its models run "at over 1000 tokens/sec on NVIDIA H100s, a speed previously possible only using custom chips" from specialized hardware providers like Groq, Cerebras, and SambaNova. When compared to other speed-optimized models, the claimed advantage remains significant -- Mercury Coder Mini is reportedly about 5.5x faster than Gemini 2.0 Flash-Lite (201 tokens/second) and 18x faster than Claude 3.5 Haiku (61 tokens/second). Opening a potential new frontier in LLMs Diffusion models do involve some tradeoffs. They typically need multiple forward passes through the network to generate a complete response, unlike traditional models that need just one pass per token. However, because diffusion models process all tokens in parallel, they achieve higher throughput despite this overhead. Inception thinks the speed advantages could impact code completion tools where instant response may affect developer productivity, conversational AI applications, resource-limited environments like mobile applications, and AI agents that need to respond quickly. If diffusion-based language models maintain quality while improving speed, they might change how AI text generation develops. So far, AI researchers have been open to new approaches. Independent AI researcher Simon Willison told Ars Technica, "I love that people are experimenting with alternative architectures to transformers, it's yet another illustration of how much of the space of LLMs we haven't even started to explore yet." On X, former OpenAI researcher Andrej Karpathy wrote about Inception, "This model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!" Questions remain about whether larger diffusion models can match the performance of models like GPT-4o and Claude 3.7 Sonnet, and if the approach can handle increasingly complex simulated reasoning tasks. For now, these models offer an alternative for smaller AI language models that doesn't seem to sacrifice capability for speed. You can try Mercury Coder yourself on Inception's demo site, and you can download code for LLaDA or try a demo on Hugging Face.
Share
Share
Copy Link
Inception Labs introduces Mercury, a diffusion-based large language model that generates text up to 10 times faster than traditional Transformer models, potentially revolutionizing AI text generation.
Inception Labs, a California-based startup founded by professors from Stanford, UCLA, and Cornell, has unveiled Mercury, touted as the first commercial-scale diffusion large language model (dLLM) 12. This innovative approach to text generation challenges the long-standing dominance of Transformer-based models, promising significant speed improvements without compromising performance.
Unlike traditional Transformer models that generate text sequentially, Mercury employs a diffusion-based architecture inspired by image and video generation techniques 13. This novel approach allows for parallel token generation, resulting in dramatically faster text production.
Key features of Mercury include:
Mercury has undergone rigorous testing against leading models:
The speed and efficiency of Mercury open up new possibilities for AI applications:
The introduction of Mercury has sparked interest among AI researchers and industry experts:
Despite its promising performance, Mercury faces some hurdles:
The emergence of diffusion-based LLMs like Mercury signals a potential paradigm shift in AI text generation. As Inception Labs works to integrate Mercury into APIs and expand its capabilities, the AI community watches closely to see if this new approach will redefine the landscape of language models and their applications 123.
With its impressive speed and performance, Mercury represents a significant step forward in LLM technology, potentially opening new avenues for AI-driven innovation across various industries.
Reference
[1]
[2]
Analytics India Magazine
|The 'First Commercial Scale' Diffusion LLM Mercury Offers over 1000 Tokens/sec on NVIDIA H100Liquid AI, an MIT spinoff, introduces Liquid Foundation Models (LFMs), a novel AI architecture that combines Transformer and Mamba models, offering superior performance and efficiency compared to traditional large language models.
3 Sources
3 Sources
Recent developments suggest open-source AI models are rapidly catching up to closed models, while traditional scaling approaches for large language models may be reaching their limits. This shift is prompting AI companies to explore new strategies for advancing artificial intelligence.
5 Sources
5 Sources
Google has launched Gemini 2.5 Pro, its latest AI model boasting advanced reasoning capabilities, multimodality, and improved performance across various benchmarks. This release marks a significant step in the ongoing AI race among tech giants.
39 Sources
39 Sources
The AI industry is witnessing a shift in focus from larger language models to smaller, more efficient ones. This trend is driven by the need for cost-effective and practical AI solutions, challenging the notion that bigger models are always better.
2 Sources
2 Sources
Microsoft Research and Tsinghua University introduce the Differential Transformer, a new LLM architecture that improves performance by reducing attention noise and enhancing focus on relevant context.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved