2 Sources
[1]
Fastest, Largest, Strongest: NVIDIA Blackwell Sweeps MLPerf Training 6.0
NVIDIA delivers the performance, scale and reliability that frontier training requires -- in benchmarks and beyond. Every breakthrough AI model starts the same way: with a training run. The infrastructure running those training jobs shapes everything: how fast teams can iterate, what scale of model they can build and whether those jobs complete reliably. As models grow in size, complexity and intelligence, the demands on training infrastructure are also rising. In MLPerf Training 6.0 -- the latest of a series of rigorous, peer-reviewed industry benchmarks for evaluating AI training performance -- the NVIDIA Blackwell platform led across every category, demonstrating: * Fastest time to train on every benchmark * Largest-scale training across 8,192 GPUs using NVIDIA Blackwell NVL72 systems * The only platform with submissions across all seven benchmarks in the suite NVIDIA brings together performance, scale and reliability in a single platform engineered through extreme codesign to enable AI model builders to launch frontier models faster, minimize training costs and start generating revenue early. Performance: Fastest Time to Train on Every Benchmark MLPerf Training 6.0 added two new mixture-of-experts (MoE) pretraining workloads to the suite: DeepSeek-V3 671B and GPT-OSS-20B, reflecting the growing centrality of MoE architectures. The NVIDIA platform was the only one to be submitted across every benchmark, and delivered the fastest time to train on all seven. This round, NVIDIA submitted results on both NVIDIA GB200 NVL72 and GB300 NVL72 rack-scale systems. Within each rack-scale system, fifth-generation NVIDIA NVLink Switches connect all 72 GPUs with high bandwidth, into a unified pool of compute and memory, enabling them to act as one giant GPU. Large-scale MoE training faces the same all-to-all communication challenge as MoE inference -- tokens must be routed across GPUs to reach the right expert subnetwork -- and NVLink's bandwidth advantage is what makes that fast and efficient at scale. NVIDIA also showcased NVFP4 training methods that increase performance while meeting strict accuracy requirements across large- and small-scale pretraining as well as fine-tuning workloads. NVIDIA continues to push low-precision training innovation across different model architectures, most recently using NVFP4 to pretrain the massive 550-billion-parameter NVIDIA Nemotron 3 Ultra model. NVIDIA GB300 NVL72 Delivered up to 1.6x Performance Over GB200 NVL72: In this round, GB300 NVL72 delivered up to 1.6x faster training than GB200 NVL72 at the same scale. Key Blackwell Ultra capabilities such as higher compute density with NVFP4, expanded memory capacity and a higher power ceiling that lets the GPU sustain peak performance drive this improvement. Scale: Largest Blackwell Cluster in MLPerf Training To support distributed training at scale, NVIDIA offers two complementary scale-out networking platforms -- NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet -- giving data centers the flexibility to build large-scale clusters optimized for their infrastructure. On DeepSeek-V3 671B, the largest MoE model in the suite, NVIDIA scaled its submission to 8,192 GPUs using GB200 NVL72 systems, the largest-scale Blackwell-based submission in MLPerf Training to date. NVIDIA also submitted results at 5,120 GPUs with NVIDIA GB200 NVL72 systems on Llama 3.1 405B, one of the largest dense LLMs in the suite. This round's results also reflect the deep co-engineering between NVIDIA and its partners on system architecture, networking and software: * Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7.07 minutes, the fastest time to train for this benchmark. * CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking. At-Scale Reliability: Built for Production In production training environments, runs can span weeks or months across hundreds of thousands of GPUs. At that scale, effective training throughput depends on both the performance of the system and the resiliency that makes it reproducible over time. The MLPerf Training v6.0 results above speak to the performance of NVIDIA's platform. For resiliency, NVIDIA's platform is engineered across two dimensions: * Fewer interruptions: NVIDIA GPUs are built to avoid failures before they occur. Before a GPU reaches a data center, NVIDIA screens it across 30+ manufacturing test stages to catch potential faults early. Once deployed, the Reliability, Availability and Serviceability Engine monitors nearly the entire chip, and self-healing capabilities automatically route around detected faults without interrupting the workload. At the network level, Spectrum-X Ethernet reroutes around failed links in milliseconds, keeping the fabric healthy without disrupting the job. * Faster recovery when interruptions happen: NVIDIA Resiliency Extension, or NVRx, minimizes the time lost when faults do occur, with capabilities spanning fault detection, recovery and health monitoring across the cluster. It automatically detects and manages underperforming nodes before they slow the rest of the cluster down. When a node experiences an interruption, rather than restarting the entire job, the system resumes from a recent checkpoint, aka a saved snapshot of the training state. Frontier AI Built on NVIDIA NVIDIA ecosystem partners also participated extensively this round, with compelling submissions from 19 organizations, including ASUSTeK, Microsoft Azure, Cisco, CoreWeave, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Inventec, Krai, Lambda, Nebius, Netweb Technologies India Ltd., Quanta Cloud Computing (QCT), Scitix, Supermicro and TTA. Many of these partners are running some of the most demanding AI training workloads on NVIDIA infrastructure. CoreWeave, which houses its NVIDIA infrastructure within Dell PowerRack systems with Dell PowerEdge servers, is home to several of these workloads. Cohere achieved 3x faster training on GB200 NVL72 for its North agentic AI platform. Midjourney, which trained its v8 image generation model on a Blackwell cluster, is now scaling a large fleet of Blackwell Ultra GPUs on CoreWeave to train upcoming image and video models. On Google Cloud, Thinking Machines Lab saw 2x faster training and serving speeds on GB300 NVL72 compared with prior-generation GPUs, accelerating frontier model research and reinforcement learning workflows. Nebius, running NVIDIA Blackwell and Blackwell Ultra infrastructure on its AI cloud, enabled Higgsfield to reduce model training time by 30%, supporting a platform that now serves 22 million users and generates over 6 million pieces of AI content per day. For a deeper technical look at the MLPerf Training 6.0 results and the optimizations behind them, read this technical blog.
[2]
NVIDIA Blackwell Sweeps Every MLPerf 6.0 Benchmark With No Competition In Sight, While GB300 Systems Run Up to 60% Faster Than GB200
The latest MLPerf Training 6.0 benchmarks are in & NVIDIA has once again secured performance records with its Blackwell GPUs. Blackwell GPUs Make Competition Go Into Hiding at MLPerf 6.0 As NVIDIA Tops Benchmark Charts The latest MLPerf Training v6.0 benchmark results were shared by MLCommons. The latest round adds two new MoE tests for large-scale and entry-level AI deployments: DeepSeek V3 (671b), and GPT-OSS 20B (21b). Being an open-source and peer-reviewed benchmark suite, MLPerf allows all vendors to list the results of their latest and greatest hardware. NVIDIA has been dominating the suite for a while, and it continues to be the trend. While NVIDIA is getting ready to launch its AI-Supercharged Vera Rubin platform in the coming months, the current-generation Blackwell architectures, especially GB300 NVL72 systems, are showcasing immense potential with no competition in sight. In the latest results, NVIDIA shows: * Fastest time to train on every benchmark * Largest-scale training across 8,192 GPUs using NVIDIA Blackwell NVL72 systems * The only platform with submissions across all seven benchmarks in the suite Coming to the benchmark results, NVIDIA was the fastest at each one of them and was also the only one to submit results across all benchmarks in MLPerf 6.0. For reference, NVIDIA's Blackwell platforms were able to achieve stellar speeds. What NVIDIA did in 4.46 mins, the nearest alternative managed to do the same in 58.63 mins, showcasing a 13.1x time split. And for the newest benchmarks, the competition didn't even submit their benchmark results. Meanwhile, NVIDIA continues to uplift the performance of its existing architectures through further optimizations. Blackwell GB200 is already much faster than it was at launch, but the GB300 systems are up to 60% faster in the same NVL72 configuration thanks to their higher AI compute density with NVFP4. The Blackwell architecture also scaled to deliver the latest cluster in MLPerf Training, comprising 8192 GPUs running within Microsoft Azure on Llama 3.1 405B. The system reached the quality target in 7.07 minutes, the fastest time-to-train within this benchmark. * Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, and reached the reference quality target in 7.07 minutes, the fastest time to train for this benchmark. * CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking. And lastly, we wanted to share the full results comparing NVIDIA Blackwell GPUs against AMD's latest MI300 series offerings up to the MI355X. In DeepSeek v3 671b, NVIDIA is the single dominating force, with the competition not even submitting a single benchmark result. In Flux1, 32 NVIDIA GB300 GPUs end up faster than 512 MI300X and 64 MI320X accelerators. No submission for the newer MI350 series was made. In Llama 2 70b, NVIDIA's GB300 and GB200 8-accelerator systems outpace the competition. Lastly, we have Llama 3.1 8b, where NVIDIA continues to offer more performance at the same number of accelerators, and pushes things beyond that with scale-up configurations. Whether at massive scale or modest configurations, NVIDIA consistently outperformed the competition, often delivering results that rivals couldn't even submit. With continued software optimizations and the upcoming Vera Rubin platform on the horizon, NVIDIA's leadership in AI training remains stronger than ever. Follow Wccftech on Google to get more of our news coverage in your feeds.
Share
Copy Link
NVIDIA Blackwell swept every category in MLPerf Training 6.0, the industry's leading AI training benchmark. The platform achieved the fastest training times across all seven benchmarks, scaled to 8,192 GPUs, and was the only vendor to submit results for every test. GB300 systems delivered up to 60% faster performance than GB200, while competitors failed to submit results for newer workloads.
NVIDIA Blackwell has solidified its position as the dominant force in AI training, sweeping every category in MLPerf Training 6.0, the industry's most rigorous peer-reviewed benchmark suite for evaluating AI training hardware
1
. The platform delivered the fastest training times across all seven benchmarks, scaled to 8,192 GPUs using NVIDIA Blackwell NVL72 systems, and remained the only vendor to submit results across the entire suite2
. The performance gap between NVIDIA and competitors proved substantial—what NVIDIA accomplished in 4.46 minutes took the nearest alternative 58.63 minutes, representing a 13.1x time difference2
.
Source: Wccftech
This round introduced two new mixture-of-experts workloads reflecting the growing importance of MoE architectures in frontier AI models: DeepSeek-V3 671B and GPT-OSS-20B
1
. For these newest benchmarks, competitors including AMD's MI300 and MI350 series failed to submit any results, leaving NVIDIA as the sole participant2
.The performance difference between GB300 systems and GB200 systems proved significant, with GB300 NVL72 delivering up to 1.6x faster training than GB200 NVL72 at the same scale
1
. This translates to approximately 60% faster performance, driven by higher compute density with NVFP4, expanded memory capacity, and a higher power ceiling that enables GPUs to sustain peak performance2
.
Source: NVIDIA
Within each NVL72 rack-scale system, fifth-generation NVLink Switches connect all 72 GPUs with high bandwidth, creating a unified pool of compute and memory that enables them to function as one giant GPU
1
. This architecture proves particularly effective for large-scale MoE training, which faces significant all-to-all communication challenges as tokens must be routed across GPUs to reach the correct expert subnetwork.NVIDIA's partners achieved notable milestones demonstrating the scalability of Blackwell-based AI training hardware. Microsoft Azure scaled Llama 3.1 405B training to 8,192 GPUs using GB200 NVL72 systems, reaching the reference quality target in 7.07 minutes—the fastest time to train for this benchmark
1
. CoreWeave delivered the fastest time to train for DeepSeek-V3 671B, reaching the quality target in 2.02 minutes at 8,192-GPU scale using GB300 NVL72 systems connected with Spectrum-X Ethernet networking1
.NVIDIA also submitted results at 5,120 GPUs with GB200 NVL72 systems on Llama 3.1 405B, one of the largest dense LLMs in the suite, representing the largest-scale Blackwell-based submission in MLPerf Training to date
1
.Related Stories
The competitive landscape in MLPerf Training 6.0 revealed a widening gap between NVIDIA and other AI training hardware vendors. In the Flux1 benchmark, 32 GB300 GPUs proved faster than 512 AMD MI300X and 64 MI320X accelerators, with no submissions made for AMD's newer MI350 series
2
. Across Llama 2 70B and Llama 3.1 8B benchmarks, GB300 and GB200 8-accelerator systems consistently outpaced competitors at equivalent configurations.NVIDIA continues advancing low-precision training innovation across different model architectures, recently using NVFP4 to pretrain the massive 550-billion-parameter NVIDIA Nemotron 3 Ultra model
1
. The platform showcased NVFP4 training methods that increase performance while meeting strict accuracy requirements across large- and small-scale pretraining as well as fine-tuning workloads.For organizations building frontier AI models, these results signal that NVIDIA's infrastructure shapes critical factors including iteration speed, model scale capabilities, and job completion reliability. With the upcoming Vera Rubin platform on the horizon and continued software optimizations, NVIDIA's position in AI training appears set to strengthen further, particularly as production training runs span weeks or months across hundreds of thousands of GPUs.
Summarized by
Navi
12 Nov 2025•Technology

03 Apr 2025•Technology

29 Aug 2024

1
Policy and Regulation

2
Business and Economy

3
Technology
