NVIDIA Blackwell Ultra Dominates MLPerf Training v5.1, Sets 10-Minute Record for Llama 3.1 405B

Reviewed byNidhi Govil

2 Sources

Share

NVIDIA's Blackwell Ultra architecture sweeps all seven MLPerf Training v5.1 benchmarks, delivering unprecedented AI training performance with its GB300 NVL72 system. The platform achieves a record 10-minute training time for Llama 3.1 405B using breakthrough NVFP4 precision technology.

News article

NVIDIA Achieves Complete MLPerf Training Dominance

NVIDIA has achieved an unprecedented sweep of all seven benchmarks in MLPerf Training v5.1, the industry's most rigorous AI training performance evaluation. The company's Blackwell Ultra-powered GB300 NVL72 rack-scale system delivered record-breaking results across large language models, image generation, recommender systems, computer vision, and graph neural networks

1

. Notably, NVIDIA was the only platform to submit results across every test category, demonstrating the versatility and maturity of its CUDA software ecosystem

2

.

Revolutionary NVFP4 Precision Technology

The standout achievement of this benchmark round was NVIDIA's successful implementation of NVFP4 precision for large language model training—a first in MLPerf Training history. This breakthrough enables calculations to be performed at significantly higher speeds while maintaining strict accuracy requirements

1

. The Blackwell Ultra architecture can perform FP4 calculations at triple the rate of FP8, delivering substantially greater AI compute performance. NVIDIA's engineering teams innovated across every layer of the technology stack to adopt this precision level, making it the only platform to successfully submit MLPerf Training results using FP4 calculations

2

.

Record-Breaking Performance Metrics

The GB300 NVL72 system demonstrated extraordinary performance improvements over previous generations. Compared to the prior Hopper architecture, Blackwell Ultra delivered more than 4x the performance for Llama 3.1 405B pretraining and nearly 5x the performance for Llama 2 70B LoRA fine-tuning using the same number of GPUs

1

. The most impressive achievement was setting a new Llama 3.1 405B training record of just 10 minutes using more than 5,000 Blackwell GPUs working in coordination—2.7x faster than the best Blackwell-based result from the previous round

2

.

Advanced Architecture and Infrastructure

The Blackwell Ultra architecture incorporates significant technological advances, including new Tensor Cores offering 15 petaflops of NVFP4 AI compute, twice the attention-layer compute capacity, and 279GB of HBM3e memory per GPU

1

. The complete GB300 NVL72 system provides an impressive 40TB total memory capacity combining GPU and CPU memory. Supporting this performance is the NVIDIA Quantum-X800 InfiniBand platform, the industry's first end-to-end 800 Gb/s scale-up networking solution, which doubles scale-out networking bandwidth compared to previous generations

2

.

New Benchmark Categories and Ecosystem Participation

This MLPerf round introduced two new benchmarks: Llama 3.1 8B and FLUX.1 image generation model. NVIDIA set performance records on both, achieving 5.2 minutes for Llama 3.1 8B training with 512 Blackwell Ultra GPUs and 12.5 minutes for FLUX.1 with 1,152 Blackwell GPUs

1

. The NVIDIA ecosystem demonstrated strong participation with submissions from 15 organizations including major technology partners such as Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro, highlighting the broad adoption of NVIDIA's AI training platform across the industry

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo