Curated by THEOUTPOST
On Thu, 14 Nov, 12:09 AM UTC
4 Sources
[1]
NVIDIA Blackwell AI GPUs up to 2.2x faster than Hopper in MLPerf v4.1 AI training benchmarks
NVIDIA has just published some juicy benchmarks of its new Blackwell AI GPUs in MLPerf v4.1 AI training workloads, where against Hopper the new Blackwell chips are up to 2.2x faster. Check it out: The new Blackwell AI GPUs have set all 7 per-accelerator records using its Nyx AI supercomputer, which packs DGX B200 systems. The Nyx AI supercomputer is 2.2x faster in Llama 2 70B (Fine-Tuning) versus Hopper H100, 2x faster in GPT-3 175B (Pre-Training) versus Hopper H100, and it also demolished the entire set of workloads inside of the MLPerf Training 4.1 suite. NVIDIA explains: "The first Blackwell training submission to the MLCommons Consortium - which creates standardized, unbiased and rigorously peer-reviewed testing for industry participants - highlights how the architecture is advancing generative AI training performance. For instance, the architecture includes new kernels that make more efficient use of Tensor Cores. Kernels are optimized, purpose-built math operations like matrix-multiplies that are at the heart of many deep learning algorithms".
[2]
NVIDIA Blackwell B200 GPU Achieves 2.2x Performance Increase Over Hopper in MLPerf Training Benchmarks
NVIDIA's Blackwell B200 is demonstrating significant performance improvements over its predecessor, the Hopper H200. In the recent MLPerf Training benchmarks, which evaluate AI training capabilities, NVIDIA submitted initial results using the Blackwell platform in the preview category. These submissions showed consistent performance enhancements across all MLPerf Training benchmarks compared to Hopper-based systems. Notably, the Blackwell GPUs delivered a twofold performance increase for GPT-3 pre-training and a 2.2 times boost for Llama 2 70B low-rank adaptation (LoRA) fine-tuning per GPU. Each benchmarking system utilized eight Blackwell GPUs, each operating at a thermal design power of 1,000 watts. The GPUs were interconnected using fifth-generation NVLink technology alongside the latest NVLink Switch. The MLPerf Training version 4.1 facilitated these evaluations, where the HGX B200 Blackwell platform showcased up to a 2.2 times performance improvement per GPU compared to the HGX H200 Hopper. These results, validated by MLCommons, highlight the capabilities of the Blackwell architecture in large language model training tasks. The Blackwell GPUs are equipped with HBM3e high-bandwidth memory and fifth-generation NVLink interconnects, contributing to the doubled performance in GPT-3 pre-training and the 2.2 times enhancement in Llama 2 70B fine-tuning relative to the Hopper generation. The benchmarking systems' network infrastructure included NVIDIA ConnectX-7 SuperNICs and Quantum-2 InfiniBand switches, which facilitate high-speed communication between nodes for distributed training workloads. In comparison, Hopper-based systems required 256 GPUs to optimize performance for the GPT-3 175B benchmark, whereas the Blackwell architecture achieved comparable performance with only 64 GPUs. This efficiency is attributed to the increased memory capacity and bandwidth provided by the HBM3e memory. Looking ahead, NVIDIA plans to release the GB200 NVL72 system, which is expected to surpass the 2.2 times performance improvement. The upcoming system will feature expanded NVLink domains, higher memory bandwidth, and tighter integration with NVIDIA Grace CPUs, along with ConnectX-8 SuperNIC and Quantum-X800 switch technologies. Source: NVIDIA
[3]
NVIDIA Blackwell Up To 2.2x Faster Than Hopper In MLPerf v4.1 AI Training Benchmarks, New World Records Set & Hopper Now Even Better
NVIDIA has shared the first benchmarks of its Blackwell GPUs in MLPerf v4.1 AI Training workloads, delivering a 2.2x gain over Hopper. NVIDIA Demolishes The Competition With Blackwell GPUs, Delivering Up To A 2.2x Gain In MLPerf v4.1 AI Training Benchmarks Versus Hopper Back in August, NVIDIA's Blackwell made its debut in the MLPerf v4.1 AI Inference benchmarks, showcasing strong performance uplifts versus the last-gen Hopper chips and also against the competition. Today, NVIDIA is sharing the first Blackwell benchmarks in the MLPerf v4.1 AI Training workloads which showcase stunning results. NVIDIA states that the demand for compute in the AI segment is increasing at an exponential scale with the launch of new models. This requires both accelerated training and inference abilities. The inference workloads were benchmarked a few months ago, & it's time to look at the training tests that encompass the same set of workloads, such as: These are some of the most popular and diverse use cases to evaluate the AI Training performance of AI accelerators and these are all evaluated in the MLPerf Training 4.1 tests. These workloads are very accurate when it comes to time-to-train time (in minutes) for the required evaluation and have 125+ MLCommons members in the consortium & affiliates backing them up to align the tests with the market. Starting first with Hopper, the H100 GPUs are now 1.3x faster in LLM pre-training performance per GPU since their first submission and offer the highest performance in AI Training among the stack of chips that are available on every benchmark. With Hopper, NVIDIA also made the highest at-scale submission for MLPerf using 11,616 Hopper H100 GPUs and the chips are datacenter-scale using NVLink, NVSwitch, ConnectX-7 SuperNIC, and Quantum-X400 IB Switches. Since launch, the NVIDIA Hopper GPUs have scaled up in performance thanks to continued software optimizations within the CUDA AI stack, now offering a 6x growth in performance versus HGX A100 and a 70% uplift over the June 2023 submission of HGX H100 in GPT-3 (175B Training) using 512 GPUs across each set of submission. Rounding up its previous Hopper Inference benchmarks, the chips offer 1.9x higher performance in Llama 3.1, 3x faster in TTFT with GH200 NVL32, and 1.5x faster throughput in Llama 3.1 405B, which once again shows the continued innovations to the software stack. There's a reason why the competition is having a hard time competing against Hopper with their new chips, let alone Blackwell. That brings us to Blackwell, the heart of the next-gen AI Data Centers. Right off the bat, NVIDIA has claimed seven per-accelerator records using its Nyx AI supercomputer which is built using DGX B200 systems. This supercomputer offers 2.2x faster Llama 2 70B (Fine-Tuning) pref versus Hopper H100, 2x faster GPT-3 175B (Pre-Training) performance versus Hopper H100, and also smashes through the entire set of workloads within the MLPerf Training 4.1 suite. With Blackwell, NVIDIA is not only doubling the performance but bringing an advanced set of technologies which we detailed in the full deep-dive provided during Hot Chips 2024. More so, NVIDIA's partners are also showcasing outstanding performance using their Hopper-based systems and a total of 11 partner submissions have been made which shows the momentum surrounding Hopper and Blackwell GPUs. The first Blackwell training submission to the MLCommons Consortium -- which creates standardized, unbiased and rigorously peer-reviewed testing for industry participants -- highlights how the architecture is advancing generative AI training performance. For instance, the architecture includes new kernels that make more efficient use of Tensor Cores. Kernels are optimized, purpose-built math operations like matrix-multiplies that are at the heart of many deep learning algorithms. Blackwell's higher per-GPU compute throughput and significantly larger and faster high bandwidth memory allows it to run the GPT-3 175B benchmark on fewer GPUs while achieving excellent per-GPU performance. Taking advantage of higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs to achieve the same performance. via NVIDIA NVIDIA also sheds some light on its yearly cadence, which doesn't only mean building new chips as fast as possible but also validating them at the data center scale and deploying them faster at the super-cluster scale. The green team makes it clear that they aren't just a company that makes chips, but they are a data center solution and system provider at scale. This is why the company has already shared its next-gen AI roadmap featuring Blackwell Ultra as the follow-up to Blackwell with more memory (288 GB HBM3e) and more compute horsepower in 2025. The Blackwell Ultra platform is expected to use the B300 naming convention. The follow-up to that is in the form of Rubin, which comes in the standard flavor in 2026 with the 8S HBM4 and 12S HBM4 variants in 2027. Lastly, NVIDIA confirms that Blackwell is now in full mass production state, so expect it to result in record-smashing revenue and performance figures in the coming quarters.
[4]
Peak Training: Blackwell Delivers Next-Level MLPerf Training Performance
Completing every MLPerf test, Blackwell ups the ante on training performance for AI-powered applications. Generative AI applications that use text, computer code, protein chains, summaries, video and even 3D graphics require data-center-scale accelerated computing to efficiently train the large language models (LLMs) that power them. In MLPerf Training 4.1 industry benchmarks, the NVIDIA Blackwell platform delivered impressive results on workloads across all tests -- and up to 2.2x more performance per GPU on LLM benchmarks, including Llama 2 70B fine-tuning and GPT-3 175B pretraining. In addition, NVIDIA's submissions on the NVIDIA Hopper platform continued to hold at-scale records on all benchmarks, including a submission with 11,616 Hopper GPUs on the GPT-3 175B benchmark. Leaps and Bounds With Blackwell The first Blackwell training submission to the MLCommons Consortium -- which creates standardized, unbiased and rigorously peer-reviewed testing for industry participants -- highlights how the architecture is advancing generative AI training performance. For instance, the architecture includes new kernels that make more efficient use of Tensor Cores. Kernels are optimized, purpose-built math operations like matrix-multiplies that are at the heart of many deep learning algorithms. Blackwell's higher per-GPU compute throughput and significantly larger and faster high-bandwidth memory allows it to run the GPT-3 175B benchmark on fewer GPUs while achieving excellent per-GPU performance. Taking advantage of larger, higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were able to run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs. The Blackwell training results follow an earlier submission to MLPerf Inference 4.1, where Blackwell delivered up to 4x more LLM inference performance versus the Hopper generation. Taking advantage of the Blackwell architecture's FP4 precision, along with the NVIDIA QUASAR Quantization System, the submission revealed powerful performance while meeting the benchmark's accuracy requirements. Relentless Optimization NVIDIA platforms undergo continuous software development, racking up performance and feature improvements in training and inference for a wide variety of frameworks, models and applications. In this round of MLPerf training submissions, Hopper delivered a 1.3x improvement on GPT-3 175B per-GPU training performance since the introduction of the benchmark. NVIDIA also submitted large-scale results on the GPT-3 175B benchmark using 11,616 Hopper GPUs connected with NVIDIA NVLink and NVSwitch high-bandwidth GPU-to-GPU communication and NVIDIA Quantum-2 InfiniBand networking. NVIDIA Hopper GPUs have more than tripled scale and performance on the GPT-3 175B benchmark since last year. In addition, on the Llama 2 70B LoRA fine-tuning benchmark, NVIDIA increased performance by 26% using the same number of Hopper GPUs, reflecting continued software enhancements. NVIDIA's ongoing work on optimizing its accelerated computing platforms enables continued improvements in MLPerf test results -- driving performance up in containerized software, bringing more powerful computing to partners and customers on existing platforms and delivering more return on their platform investment. Partnering Up NVIDIA partners, including system makers and cloud service providers like ASUSTek, Azure, Cisco, Dell, Fujitsu, Giga Computing, Lambda Labs, Lenovo, Oracle Cloud, Quanta Cloud Technology and Supermicro also submitted impressive results to MLPerf in this latest round. A founding member of MLCommons, NVIDIA sees the role of industry-standard benchmarks and benchmarking best practices in AI computing as vital. With access to peer-reviewed, streamlined comparisons of AI and HPC platforms, companies can keep pace with the latest AI computing innovations and access crucial data that can help guide important platform investment decisions.
Share
Share
Copy Link
NVIDIA's new Blackwell AI GPUs have set new performance records in MLPerf v4.1 AI training benchmarks, showing up to 2.2x faster performance compared to their predecessor, the Hopper GPUs. This significant leap in AI training capabilities has implications for various AI applications, including large language models.
NVIDIA has released the first benchmarks of its new Blackwell GPUs in MLPerf v4.1 AI Training workloads, showcasing remarkable performance improvements over its predecessor, the Hopper architecture. The results demonstrate up to a 2.2x performance gain in critical AI training tasks 123.
The Blackwell GPUs, tested using NVIDIA's Nyx AI supercomputer with DGX B200 systems, set new records across all seven per-accelerator benchmarks in the MLPerf Training 4.1 suite 1. Key highlights include:
The Blackwell architecture introduces several improvements that contribute to its enhanced performance:
These advancements allow Blackwell to achieve comparable performance with fewer GPUs. For instance, the GPT-3 175B benchmark that required 256 Hopper GPUs can now be run on just 64 Blackwell GPUs without compromising per-GPU performance 34.
The significant performance boost offered by Blackwell GPUs has far-reaching implications for AI training, particularly for large language models and generative AI applications. The improved efficiency in training times and resource utilization could accelerate the development and deployment of more advanced AI models across various industries 4.
NVIDIA emphasizes that their platforms undergo continuous software development, leading to ongoing performance improvements. For example, since their introduction, Hopper H100 GPUs have achieved a 1.3x improvement in LLM pre-training performance per GPU 4.
NVIDIA's partners, including major cloud service providers and system makers, have also submitted impressive results to MLPerf using NVIDIA's technology. This widespread adoption underscores the impact of NVIDIA's innovations on the AI computing landscape 4.
Looking ahead, NVIDIA has already shared its next-gen AI roadmap, featuring Blackwell Ultra with 288 GB HBM3e memory in 2025, followed by the Rubin architecture in 2026 and 2027. With Blackwell now in full mass production, industry observers anticipate record-breaking revenue and performance figures in the coming quarters 3.
As AI continues to evolve and demand for compute power grows exponentially, NVIDIA's advancements in GPU technology play a crucial role in shaping the future of AI training and inference capabilities across various sectors.
Reference
[1]
[2]
[4]
NVIDIA's latest Blackwell B200 GPU demonstrates unprecedented AI performance in the MLPerf Inference 4.1 benchmarks, outperforming its predecessor and competitors. The results showcase significant advancements in generative AI and large language model processing.
4 Sources
4 Sources
NVIDIA's new Blackwell GPUs set records in MLPerf Inference v5.0 benchmarks, while AMD's Instinct MI325X shows competitive performance against NVIDIA's H200 in specific tests.
3 Sources
3 Sources
NVIDIA showcases its next-generation Blackwell AI GPUs, featuring upgraded NVLink technology and introducing FP4 precision. The company also reveals its roadmap for future AI and data center innovations.
4 Sources
4 Sources
Nvidia announces the Blackwell Ultra B300 GPU, offering 1.5x faster performance than its predecessor with 288GB HBM3e memory and 15 PFLOPS of dense FP4 compute, designed to meet the demands of advanced AI reasoning and inference.
9 Sources
9 Sources
OpenAI receives one of the first engineering builds of NVIDIA's DGX B200 AI system, featuring the new Blackwell B200 GPUs. This development marks a significant advancement in AI computing capabilities, with potential implications for AI model training and inference.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved