2 Sources
2 Sources
[1]
NVIDIA Wins Every MLPerf Training v5.1 Benchmark
NVIDIA Blackwell Ultra with NVFP4 delivers giant leap for large language model training. In the age of AI reasoning, training smarter, more capable models is critical to scaling intelligence. Delivering the massive performance to meet this new age requires breakthroughs across GPUs, CPUs, NICs, scale-up and scale-out networking, system architectures, and mountains of software and algorithms. In MLPerf Training v5.1 -- the latest round in a long-running series of industry-standard tests of AI training performance -- NVIDIA swept all seven tests, delivering the fastest time to train across large language models (LLMs), image generation, recommender systems, computer vision and graph neural networks. NVIDIA was also the only platform to submit results on every test, underscoring the rich programmability of NVIDIA GPUs, and the maturity and versatility of its CUDA software stack. NVIDIA Blackwell Ultra Doubles Down The GB300 NVL72 rack-scale system, powered by the NVIDIA Blackwell Ultra GPU architecture, made its debut in MLPerf Training this round, following a record-setting showing in the most recent MLPerf Inference round. Compared with the prior-generation Hopper architecture, the Blackwell Ultra-based GB300 NVL72 delivered more than 4x the Llama 3.1 405B pretraining and nearly 5x the Llama 2 70B LoRA fine-tuning performance using the same number of GPUs. These gains were fueled by Blackwell Ultra's architectural improvements -- including new Tensor Cores that offer 15 petaflops of NVFP4 AI compute, twice the attention-layer compute and 279GB of HBM3e memory -- as well as new training methods that tapped into the architecture's enormous NVFP4 compute performance. Connecting multiple GB300 NVL72 systems, the NVIDIA Quantum-X800 InfiniBand platform -- the industry's first end-to-end 800 Gb/s scale-up networking platform -- also made its MLPerf debut, doubling scale-out networking bandwidth compared with the prior generation. Performance Unlocked: NVFP4 Accelerates LLM Training Key to the outstanding results this round was performing calculations using NVFP4 precision -- a first in the history of MLPerf Training. One way to increase compute performance is to build an architecture capable of performing computations on data represented with fewer bits, and then to perform those calculations at a faster rate. However, lower precision means less information is available in each calculation. This means using low-precision calculations in the training process calls for careful design decisions to keep results accurate. NVIDIA teams innovated at every layer of the stack to adopt FP4 precision for LLM training. The NVIDIA Blackwell GPU can perform FP4 calculations -- including the NVIDIA-designed NVFP4 format as well as other FP4 variants -- at double the rate of FP8. Blackwell Ultra boosts that to 3x, enabling the GPUs to deliver substantially greater AI compute performance. NVIDIA is the only platform to date that has submitted MLPerf Training results with calculations performed using FP4 precision while meeting the benchmark's strict accuracy requirements. NVIDIA Blackwell Scales to New Heights NVIDIA set a new Llama 3.1 405B time-to-train record of just 10 minutes, powered by more than 5,000 Blackwell GPUs working together efficiently. This entry was 2.7x faster than the best Blackwell-based result submitted in the prior round, resulting from efficient scaling to more than twice the number of GPUs, as well as the use of NVFP4 precision to dramatically increase the effective performance of each Blackwell GPU. To illustrate the performance increase per GPU, NVIDIA submitted results this round using 2,560 Blackwell GPUs, achieving a time to train of 18.79 minutes -- 45% faster than the submission last round using 2,496 GPUs. New Benchmarks, New Records NVIDIA also set performance records on the two new benchmarks added this round: Llama 3.1 8B and FLUX.1. Llama 3.1 8B -- a compact yet highly capable LLM -- replaced the long-running BERT-large model, adding a modern, smaller LLM to the benchmark suite. NVIDIA submitted results with up to 512 Blackwell Ultra GPUs, setting the bar at 5.2 minutes to train. In addition, FLUX.1 -- a state-of-the-art image generation model -- replaced Stable Diffusion v2, with only the NVIDIA platform submitting results on the benchmark. NVIDIA submitted results using 1,152 Blackwell GPUs, setting a record time to train of 12.5 minutes. NVIDIA continued to hold records on the existing graph neural network, object detection and recommender system tests. A Broad and Deep Partner Ecosystem The NVIDIA ecosystem participated extensively this round, with compelling submissions from 15 organizations including ASUSTeK, Dell Technologies, Giga Computing, Hewlett Packard Enterprise, Krai, Lambda, Lenovo, Nebius, Quanta Cloud Technology, Supermicro, University of Florida, Verda (formerly DataCrunch) and Wiwynn. NVIDIA is innovating at a one-year rhythm, driving significant and rapid performance increases across pretraining, post-training and inference -- paving the way to new levels of intelligence and accelerating AI adoption. See more NVIDIA performance data on the Data Center Deep Learning Product Performance Hub and Performance Explorer pages.
[2]
NVIDIA Blackwell Ultra Secures Win Across All Seven MLPerf AI Training Benchmarks, GB300 NVL72 Sets Record 10 Minutes Training Time For Llama 405B
By securing wins across all MLPerf training tests, NVIDIA boasts its Blackwell Ultra-based GB300 NVL72 platform, which delivers leading AI training performance. When it comes to delivering leading AI performance, NVIDIA GPUs have always been at the forefront. The Blackwell-based data center GPUs have already showcased their incredible potential several times previously, and the latest GB300 NVL72 platform is no exception. Today, NVIDIA has proudly announced that its Blackwell Ultra-powered AI GPUs have secured the first position in every MLPerf training benchmark, proving that its GB300 NVL72 rack-scale system is still the best possible choice for intensive AI workloads. In the blog post, NVIDIA claims that it's the only player to have submitted the results on every MLPerf test and has expanded the performance gap between itself and its rivals. The graph it shared shows that NVIDIA's GB300 platform has scored "hundreds" of MLPerf Training and Inference wins in 2025 alone. The most recent ones are these: The benchmark results show that NVIDIA achieved significantly superior results with the same number of Blackwell Ultra GPUs in the rack system as the Hopper-based GPUs. In Llama 3.1 40B pretraining, the GB300 GPUs deliver over 4X the performance vs H100 and nearly 2X vs the Blackwell GB200. Similarly, in the Llama 2 70B Fine-Tuning, 8 GB300 GPUs delivered 5X the performance vs H100. NVIDIA also boasted about its CUDA ecosystem, which has a big leverage over its competitors. The CUDA software stack excels at it, but the rack system itself, plus the Quantum-X800 InfiniBand at 800 GB/s networking, is also unmatched. The GB300 NVL72 brings 279 GB HBM3e memory per GPU, and an incredible 40 TB total capacity with GPU and CPU memory combined. Such a monster memory configuration speeds up AI workloads, but using the FP4 precision for training is also the key to excellent performance. NVIDIA says that it has ensured the adoption of FP4 precision for LLM training at every layer to double the speed of calculations compared to FP8. The Blackwell Ultra further boosts that to 3X, which is why NVIDIA was able to crush the competitors and deliver drastically superior performance without increasing the GPU count. Compared to its June submission, the new results were achieved using 5,120 Blackwell GPUs, which took only 10 minutes to train the Llama 3.1 405B parameter.
Share
Share
Copy Link
NVIDIA's Blackwell Ultra architecture sweeps all seven MLPerf Training v5.1 benchmarks, delivering unprecedented AI training performance with its GB300 NVL72 system. The platform achieves a record 10-minute training time for Llama 3.1 405B using breakthrough NVFP4 precision technology.

NVIDIA has achieved an unprecedented sweep of all seven benchmarks in MLPerf Training v5.1, the industry's most rigorous AI training performance evaluation. The company's Blackwell Ultra-powered GB300 NVL72 rack-scale system delivered record-breaking results across large language models, image generation, recommender systems, computer vision, and graph neural networks
1
. Notably, NVIDIA was the only platform to submit results across every test category, demonstrating the versatility and maturity of its CUDA software ecosystem2
.The standout achievement of this benchmark round was NVIDIA's successful implementation of NVFP4 precision for large language model training—a first in MLPerf Training history. This breakthrough enables calculations to be performed at significantly higher speeds while maintaining strict accuracy requirements
1
. The Blackwell Ultra architecture can perform FP4 calculations at triple the rate of FP8, delivering substantially greater AI compute performance. NVIDIA's engineering teams innovated across every layer of the technology stack to adopt this precision level, making it the only platform to successfully submit MLPerf Training results using FP4 calculations2
.The GB300 NVL72 system demonstrated extraordinary performance improvements over previous generations. Compared to the prior Hopper architecture, Blackwell Ultra delivered more than 4x the performance for Llama 3.1 405B pretraining and nearly 5x the performance for Llama 2 70B LoRA fine-tuning using the same number of GPUs
1
. The most impressive achievement was setting a new Llama 3.1 405B training record of just 10 minutes using more than 5,000 Blackwell GPUs working in coordination—2.7x faster than the best Blackwell-based result from the previous round2
.Related Stories
The Blackwell Ultra architecture incorporates significant technological advances, including new Tensor Cores offering 15 petaflops of NVFP4 AI compute, twice the attention-layer compute capacity, and 279GB of HBM3e memory per GPU
1
. The complete GB300 NVL72 system provides an impressive 40TB total memory capacity combining GPU and CPU memory. Supporting this performance is the NVIDIA Quantum-X800 InfiniBand platform, the industry's first end-to-end 800 Gb/s scale-up networking solution, which doubles scale-out networking bandwidth compared to previous generations2
.This MLPerf round introduced two new benchmarks: Llama 3.1 8B and FLUX.1 image generation model. NVIDIA set performance records on both, achieving 5.2 minutes for Llama 3.1 8B training with 512 Blackwell Ultra GPUs and 12.5 minutes for FLUX.1 with 1,152 Blackwell GPUs
1
. The NVIDIA ecosystem demonstrated strong participation with submissions from 15 organizations including major technology partners such as Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro, highlighting the broad adoption of NVIDIA's AI training platform across the industry2
.Summarized by
Navi
[1]
29 Aug 2024

10 Sept 2025•Technology

14 Nov 2024•Technology

1
Policy and Regulation

2
Technology

3
Business and Economy
