NVIDIA AI Servers Achieve 10x Performance Boost for Mixture of Experts Models from DeepSeek and Moonshot

Reviewed byNidhi Govil

3 Sources

Share

NVIDIA has shattered performance records with its GB200 Blackwell NVL72 servers, delivering a tenfold efficiency boost for mixture-of-expert AI models including those from China's DeepSeek and Moonshot AI. The breakthrough addresses a critical computing bottleneck in scaling MoE models, using 72 chips with high-bandwidth links and 30TB of fast shared memory to enable expert parallelism at unprecedented levels.

NVIDIA Breaks Through MoE Performance Barriers

NVIDIA has achieved a breakthrough in AI infrastructure with its GB200 'Blackwell' NVL72 configuration, delivering a 10x performance increase for mixture-of-expert AI models compared to its previous Hopper HGX 200 generation

1

. The advancement directly tackles one of the industry's most pressing challenges: scaling Mixture of Experts (MoE) architectures without hitting a computing bottleneck. These models have gained prominence because they activate only a portion of their parameters per token depending on the query type, making them far more computationally efficient than traditional large language models

1

.

Source: Digit

Source: Digit

The timing matters significantly. As the AI world shifts focus from training models to deploying them for millions of users, NVIDIA faces intensifying competition from rivals like AMD and Cerebras in the inference market

2

. This tenfold efficiency boost demonstrates NVIDIA's ability to maintain its edge even as mixture-of-expert AI models, popularized by China's DeepSeek earlier this year, require less training on its chips than traditional approaches.

How Co-Design and Server Architecture Enable the Leap

The performance gains stem from NVIDIA's co-design approach rather than raw chip power alone. The GB200 'Blackwell' NVL72 packs 72 chips into a single server with 30TB of fast shared memory, creating an environment where expert parallelism reaches new levels

1

. This dense layout reduces hop distance between accelerators while high-bandwidth links enable chips to share data without congestion during peak loads

3

.

Source: Wccftech

Source: Wccftech

NVIDIA tested its capabilities on Moonshot AI's Kimi K2 Thinking MoE model, an open-source system with 32 billion activated parameters per forward pass

1

. The company also reported similar performance gains with DeepSeek's models

2

. Full-stack optimizations play a crucial role, including the NVIDIA Dynamo framework that orchestrates disaggregated serving by assigning prefill and decode tasks to different GPUs. The NVFP4 format helps maintain accuracy while boosting throughput and efficiency

1

.

Why Mixture-of-Expert AI Models Benefit Most

Mixture of Experts architectures rely on selecting specialized subnetworks for each token or task, but traditional servers struggle when experts sit across multiple accelerators, creating delays every time the model switches paths

3

. NVIDIA AI servers address this by keeping communication fast and predictable, ensuring expert selection doesn't slow down inference. Token batches get split and scattered across GPUs constantly, with the system designed so communication volume increases at a non-linear rate

1

.

For DeepSeek and Moonshot AI, the gains arrived at a critical moment as their models grow in size and complexity while inference costs become a major limiting factor. Once deployed on the new hardware, both saw faster response times, higher token throughput, and the ability to reduce inference costs significantly

3

. This makes it easier to serve millions of users while keeping operational expenses under control.

Implications for the Global AI Hardware Race

The performance leap intensifies competition in the server market and carries wider implications for the global AI hardware race. China's leading AI firms have been searching for ways to expand capabilities despite supply constraints, and gains of this magnitude help close the gap with American competitors

3

. The mixture-of-experts approach exploded in popularity after DeepSeek shocked the world with a high-performing open-source model in early 2025, with the technique subsequently adopted by ChatGPT maker OpenAI, France's Mistral, and China's Moonshot AI

2

.

NVIDIA competitor AMD is working on a similar server packed with multiple powerful chips that it has said will come to market next year

2

. Meanwhile, Cerebras also competes in this space as cloud providers seek hardware that can run massive models with lower energy costs

3

. The GB200 NVL72 configuration is now reaching the phase of the supply chain where many frontier models utilize these systems to enhance their capabilities, positioning NVIDIA to capitalize on MoE deployment across increasingly diverse environments

1

. Efficiency is becoming as important as raw power, and these early numbers suggest the next wave of AI will be shaped as much by server architecture as by model innovation itself

3

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo