3 Sources
3 Sources
[1]
NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 'Blackwell' NVL72 Servers, Fueled by Co-Design Breakthroughs
Scaling performance on 'Mixture of Experts' AI models is one of the biggest industry constraints, but it appears that NVIDIA has managed to make a breakthrough, credited to co-design performance scaling laws. The AI world has been racing to scale up foundational LLMs by ramping up token parameters and ensuring that their models excel in performance and applications, but with this approach, there's a limit to the compute resources companies can invest in their AI models. Now here, 'Mixture of Experts' frontier AI models come in play, since for a query, they don't activate the entire parameters per token, rather just a portion of it, depending upon the type of service request. While MoEs have been dominant in LLMs, scaling them up introduces a massive computing bottleneck, which NVIDIA has successfully overcome. In a press release by the company, NVIDIA has disclosed that with the GB200 'Blackwell' NVL72 configuration onboard, the firm has essentially scaled up performance by a factor of 10 when compared with the Hopper HGX 200. The firm tested its computing capabilities on the Kimi K2 Thinking MoE model, an open-source LLM with 32 billion activated parameters per forward pass, which is known to be a standout option in its segment. Team Green claims that the Blackwell architecture is 'poised' to capitalize on the rise of frontier MoE models. To address the performance bottlenecks involved in scaling MoE AI models, NVIDIA has employed the 'co-design' approach, which means that by utilizing the 72-chip configuration with the GB200, coupled with 30TB of fast shared memory, NVIDIA takes expert parallelism to a whole new level, ensuring that token batches get split and scattered across GPUs constantly, and the communication volume increases at a non-linear rate. Other optimizations include: Other full-stack optimizations also play a key role in unlocking high inference performance for MoE models. The NVIDIA Dynamo framework orchestrates disaggregated serving by assigning prefill and decode tasks to different GPUs, allowing decode to run with large expert parallelism, while prefill uses parallelism techniques better suited to its workload. The NVFP4 format helps maintain accuracy while further boosting performance and efficiency. This achievement is a significant development for NVIDIA and its partners, especially since the GB200 NVL72 configuration is now at the phase of the supply chain where many frontier models utilize AI servers to enhance their capabilities. MoE models are known for their computationally efficient nature, which is why their deployment across a wide range of environments is becoming increasingly prominent, and NVIDIA appears to be at the center of capitalizing on this trend.
[2]
Nvidia servers speed up AI models from China's Moonshoot AI and others tenfold
SAN FRANCISCO, Dec 3 (Reuters) - Nvidia on Wednesday published new data showing that its latest artificial intelligence server can improve the performance of new models - including two popular ones from China - by 10 times. The data comes as the AI world has shifted its focus from training AI models, where Nvidia dominates the market, to putting them to use for millions of users, where Nvidia faces far more competition from rivals such as Advanced Micro Devices and Cerebras. Nvidia's data focused on what are known as mixture-of-expert AI models. The technique is a way of making AI models more efficient by breaking up questions into pieces that are assigned to "experts" within the model. That exploded in popularity this year after China's DeepSeek shocked the world with a high-performing open source model that took less training on Nvidia chips than rivals in early 2025. Since then, the mixture-of-experts approach has been adopted by ChatGPT maker OpenAI, France's Mistral and China's Moonshoot AI, which in July released a highly-ranked open source model of its own. Meanwhile, Nvidia has focused on making the case that while such models might require less training on its chips, its offerings can still be used to serve those models to users. Nvidia on Wednesday said that its latest AI server, which packs 72 of its leading chips into a single computer with speedy links between them, improved the performance of Moonshot's Kimi K2 Thinking model by 10 times compared to the previous generation of Nvidia servers, a similar performance gain to what Nvidia has seen with DeepSeek's models. Nvidia said the gains primarily came from the sheer number of chips it can pack into servers and the fast links between them, an area where Nvidia still has advantages over its rivals. Nvidia competitor AMD is working on a similar server packed with multiple powerful chips that it has said will come to market next year. (Reporting by Stephen Nellis in San Francisco, editing by Deepa Babington)
[3]
DeepSeek and Moonshot AI gains explained: How NVIDIA AI servers did the efficiency boost
High bandwidth accelerator fabric delivers tenfold performance jump in large AI systems NVIDIA's newest AI server architecture has delivered one of the most significant performance jumps seen this year, unlocking a tenfold efficiency boost for major Chinese models from DeepSeek and Moonshot AI. The gains are not simply the result of faster chips but of a redesigned server layout built to handle the computational demands of mixture of expert models, which are rapidly becoming the preferred architecture for large scale AI systems. For both DeepSeek and Moonshot AI, the jump came at a crucial moment. Their models are growing in size, complexity and real world use cases, and inference cost has become a major limiting factor. NVIDIA's new server design aims directly at that pain point. Instead of scaling performance through raw chip power alone, the company created a tightly integrated system where seventy two accelerators communicate through high bandwidth links that minimise bottlenecks during expert routing. The result is a server that can deliver far more throughput without multiplying energy or hardware requirements. Also read: Workspace Studio explained: AI agents will automate more work, believes Google Why mixture of expert models benefit most Mixture of expert models rely on selecting specialised subnetworks for each token or task. This architecture can be highly efficient, but only when hardware is capable of moving data between chips with very low latency. Traditional servers struggle when experts sit across multiple accelerators, leading to delays every time the model switches paths. NVIDIA's system is designed to keep communication fast and predictable so that expert selection does not slow down inference. DeepSeek's and Moonshot's models depend heavily on rapid expert switching. Once deployed on the new hardware, they saw faster response times, higher token throughput and significantly lower cost per query. These gains make it easier for them to serve millions of users while keeping operational expenses under control. How the new server design creates the efficiency jump The efficiency improvement comes from three engineering choices. First, the dense seventy two chip layout reduces hop distance between accelerators. Second, the server uses a high speed fabric that lets chips share data without congestion during peak load. Third, memory bandwidth and caching have been optimised to reduce repeated data fetches. Also read: Crucial RAM is dead, blame it on AI: Why Micron is shifting its memory priorities Together, these changes create a pipeline where MoE models run closer to their theoretical speed. For developers, this means they can deploy larger models or handle higher traffic without adding more servers. For users, it means faster responses and more stable performance even during heavy demand. A shift in the global AI hardware race The performance leap has wider implications. China's leading AI firms have been searching for ways to expand their capabilities despite supply constraints. Gains of this magnitude help close the gap with American competitors and highlight how hardware choices can influence national AI progress. The move also intensifies competition in the server market. AMD and Cerebras are preparing next generation systems, and cloud providers want hardware that can run massive models with lower energy costs. NVIDIA is positioning its new server as the answer to this demand and as the foundation for a future where MoE architectures dominate the landscape. Efficiency is becoming as important as raw power. DeepSeek's and Moonshot's results show how much can change when hardware is tuned for the models that define modern AI. NVIDIA argues that these systems will become the new standard. If the early numbers are any indication, the next wave of AI will be shaped as much by server design as by model innovation itself.
Share
Share
Copy Link
NVIDIA has shattered performance records with its GB200 Blackwell NVL72 servers, delivering a tenfold efficiency boost for mixture-of-expert AI models including those from China's DeepSeek and Moonshot AI. The breakthrough addresses a critical computing bottleneck in scaling MoE models, using 72 chips with high-bandwidth links and 30TB of fast shared memory to enable expert parallelism at unprecedented levels.
NVIDIA has achieved a breakthrough in AI infrastructure with its GB200 'Blackwell' NVL72 configuration, delivering a 10x performance increase for mixture-of-expert AI models compared to its previous Hopper HGX 200 generation
1
. The advancement directly tackles one of the industry's most pressing challenges: scaling Mixture of Experts (MoE) architectures without hitting a computing bottleneck. These models have gained prominence because they activate only a portion of their parameters per token depending on the query type, making them far more computationally efficient than traditional large language models1
.
Source: Digit
The timing matters significantly. As the AI world shifts focus from training models to deploying them for millions of users, NVIDIA faces intensifying competition from rivals like AMD and Cerebras in the inference market
2
. This tenfold efficiency boost demonstrates NVIDIA's ability to maintain its edge even as mixture-of-expert AI models, popularized by China's DeepSeek earlier this year, require less training on its chips than traditional approaches.The performance gains stem from NVIDIA's co-design approach rather than raw chip power alone. The GB200 'Blackwell' NVL72 packs 72 chips into a single server with 30TB of fast shared memory, creating an environment where expert parallelism reaches new levels
1
. This dense layout reduces hop distance between accelerators while high-bandwidth links enable chips to share data without congestion during peak loads3
.
Source: Wccftech
NVIDIA tested its capabilities on Moonshot AI's Kimi K2 Thinking MoE model, an open-source system with 32 billion activated parameters per forward pass
1
. The company also reported similar performance gains with DeepSeek's models2
. Full-stack optimizations play a crucial role, including the NVIDIA Dynamo framework that orchestrates disaggregated serving by assigning prefill and decode tasks to different GPUs. The NVFP4 format helps maintain accuracy while boosting throughput and efficiency1
.Mixture of Experts architectures rely on selecting specialized subnetworks for each token or task, but traditional servers struggle when experts sit across multiple accelerators, creating delays every time the model switches paths
3
. NVIDIA AI servers address this by keeping communication fast and predictable, ensuring expert selection doesn't slow down inference. Token batches get split and scattered across GPUs constantly, with the system designed so communication volume increases at a non-linear rate1
.For DeepSeek and Moonshot AI, the gains arrived at a critical moment as their models grow in size and complexity while inference costs become a major limiting factor. Once deployed on the new hardware, both saw faster response times, higher token throughput, and the ability to reduce inference costs significantly
3
. This makes it easier to serve millions of users while keeping operational expenses under control.Related Stories
The performance leap intensifies competition in the server market and carries wider implications for the global AI hardware race. China's leading AI firms have been searching for ways to expand capabilities despite supply constraints, and gains of this magnitude help close the gap with American competitors
3
. The mixture-of-experts approach exploded in popularity after DeepSeek shocked the world with a high-performing open-source model in early 2025, with the technique subsequently adopted by ChatGPT maker OpenAI, France's Mistral, and China's Moonshot AI2
.NVIDIA competitor AMD is working on a similar server packed with multiple powerful chips that it has said will come to market next year
2
. Meanwhile, Cerebras also competes in this space as cloud providers seek hardware that can run massive models with lower energy costs3
. The GB200 NVL72 configuration is now reaching the phase of the supply chain where many frontier models utilize these systems to enhance their capabilities, positioning NVIDIA to capitalize on MoE deployment across increasingly diverse environments1
. Efficiency is becoming as important as raw power, and these early numbers suggest the next wave of AI will be shaped as much by server architecture as by model innovation itself3
.Summarized by
Navi
05 Jun 2025•Technology

10 Sept 2025•Technology

03 Apr 2025•Technology
