2 Sources
2 Sources
[1]
MLPerf Introduces Largest and Smallest LLM Benchmarks
Nvidia topped MLPerf's new reasoning benchmark with its new Blackwell Ultra GPU, packaged in a GB300 rack-scale design. The machine learning field is moving fast, and the yardsticks used measure progress in it are having to race to keep up. A case in point, MLPerf, the bi-annual machine learning competition sometimes termed "the Olympics of AI," introduced three new benchmark tests, reflecting new directions in the field. "Lately, it has been very difficult trying to follow what happens in the field," says Miro Hodak, AMD engineer and MLPerf Inference working group co-chair. "We see that the models are becoming progressively larger, and in the last two rounds we have introduced the largest models we've ever had." The chips that tackled these new benchmarks came from the usual suspects -- Nvidia, Arm, and Intel. Nvidia topped the charts, introducing its new Blackwell Ultra GPU, packaged in a GB300 rack-scale design. AMD put up a strong performance, introducing its latest MI325X GPUs. Intel proved that one can still do inference on CPUs with their Xeon submissions, but also entered the GPU game with an Intel Arc Pro submission. Last round, MLPerf introduced its largest benchmark yet, a large language model based on Llama3.1-403B. This round, they topped themselves yet again, introducing a benchmark based on the Deepseek R1 671B model -- more than 1.5 times the number of parameters of the previous largest benchmark. As a reasoning model, Deepseek R1 goes through several steps of chain-of-thought when approaching a query. This means much of the computation happens during inference then in normal LLM operation, making this benchmark even more challenging. Reasoning models are claimed to be the most accurate, making them the technique of choice for science, math, and complex programming queries. In addition to the largest LLM benchmark yet, MLPerf also introduced the smallest, based on Llama3.1-8B. There is growing industry demand for low latency yet high-accuracy reasoning, explained Taran Iyengar, MLPerf Inference task force chair. Small LLMs can supply this, and are an excellent choice for tasks such as text summarization and edge applications. This brings the total count of LLM-based benchmarks to a confusing four. They include the new, smallest Llama3.1-8B benchmark; a pre-existing Llama2-70B benchmark; last round's introduction of the Llama3.1-403B benchmark; and the largest, the new Deepseek R1 model. If nothing else, this signals LLMs are not going anywhere. In addition to the myriad LLMs, this round of MLPerf inference included a new voice-to-text model, based on Whisper-large-v3. This benchmark is a response to the growing number of voice-enabled applications, be it smart devices or speech-based AI interfaces. TheMLPerf Inference competition has two broad categories: "closed," which requires using the reference neural network model as-is without modifications, and "open," where some modifications to the model are allowed. Within those, there are several subcategories related to how the tests are done and in what sort of infrastructure. We will focus on the "closed" datacenter server results for the sake of sanity. Surprising no one, the best performance per accelerator on each benchmark, at least in the 'server' category, was achieved by an Nvidia GPU-based system. Nvidia also unveiled the Blackwell Ultra, topping the charts in the two largest benchmarks: Lllama3.1-405B and DeepSeek R1 reasoning. Blackwell Ultra is a more powerful iteration of the Blackwell architecture, featuring significantly more memory capacity, double the acceleration for attention layers, 1.5x more AI compute, and faster memory and connectivity compared to the standard Blackwell. It is intended for the larger AI workloads, like the two benchmarks it was tested on. In addition to the hardware improvements, director of accelerated computing products at Nvidia Dave Salvator attributes the success of Blackwell Ultra to two key changes. First, the use of Nvidia's proprietary 4-bit floating point number format, NVFP4. "We can deliver comparable accuracy to formats like BF16," Salvator says, while using a lot less computing power. The second is so-called disaggregated serving. The idea behind disaggregated serving is that there are two main parts to the inference workload: prefill, where the query ("Please summarize this report.") and its entire context window (the report) are loaded into the LLM, and generation/decoding, where the output is actually calculated. These two stages have different requirements. While prefill is compute heavy, generation/decoding is much more dependent on memory bandwidth. Salvator says that by assigning different groups of GPUs to the two different stages, Nvidia achieves a performance gain of nearly 50 percent. AMD's newest accelerator chip, MI355X launched in July. The company offered results only in the "open" category where software modifications to the model are permitted. Like Blackwell Ultra, MI355x features 4-bit floating point support, as well as expanded high-bandwidth memory. The MI355X beat its predecessor, the MI325X, in the open Llama2.1-70B benchmark by a factor of 2.7, says Mahesh Balasubramanian, senior director of data center GPU product marketing at AMD. AMD's "closed" submissions included systems powered by AMD MI300X and MI325X GPUs. The more advanced MI325X computer performed similarly to those built with Nvidia H200s on the Lllama2-70b, the mixture of experts test, and image generation benchmarks. This round also included the first hybrid submission, where both AMD MI300X and MI325X GPUs were used for the same inference task,the Llama2-70b benchmark. The use of hybrid GPUs is important, because new GPUs are coming at a yearly cadence, and the older models, deployed en-masse, are not going anywhere. Being able to spread workloads between different kinds of GPUs is an essential step. In the past, Intel has remained steadfast that one does not need a GPU to do machine learning. Indeed, submissions using Intel's Xeon CPU still performed on par with the Nvidia L4 on the object detection benchmark but trailed on the recommender system benchmark. This round, for the first time, an Intel GPU also made a showing. The Intel Arc Pro was first released in 2022. The MLPerf submission featured a graphics card called the MaxSun Intel Arc Pro B60 Dual 48G Turbo , which contains two GPUs and 48 gigabytes of memory. The system performed on-par with Nvidia's L40S on the small LLM benchmark and trailed it on the Llama2-70b benchmark.
[2]
Nvidia claims software and hardware upgrades allow Blackwell Ultra GB300 to dominate MLPerf benchmarks -- touts 45% DeepSeek R-1 inference throughput increase over GB200
Big increases in performance when running a range of popular open source models. Nvidia has broken its own records in MLPerf benchmarks using its latest-generation Blackwell Ultra GB300 NVL72 rack-scale system, delivering what it claims is a 45% increase in inference performance over the Blackwell-based GB200 platform in DeepSeek R1 tests. Combining hardware improvements and software optimizations, Nvidia claims the top spot when running a range of models, and suggests this should be a primary consideration for any developers building out "AI factories," as it could result in major enhancements for revenue generation. Nvidia's Blackwell architecture is at the heart of its latest-generation RTX 50-series graphics cards, which offer the best performance for gaming, even if AMD's RX 9000-series arguably offers better bang for buck. But it's also what's under the hood of the big AI-powering GPU stacks like its GB200 platform, which is being built into a range of data centers all over the world to power next-generation AI applications. Blackwell Ultra, GB300, is the enhanced version of that with even more performance, and Nvidia has now tested it with some impressive MLPerf records. The latest version of the MLPerf benchmark includes inference performance testing using the DeepSeek R1, Llama 3.1 405B, Llama 3.1 8B, and Whisper models, and GB300 NVL72 stole the show in all of them. Nvidia claims a 45% increase in performance over GB200 when running the DeepSeek model, and up to five times the performance of older Hopper GPUs - although Nvidia does note those comparative results came from unverified third parties. Part of these performance enhancements comes from the more capable tensor cores used with Blackwell Ultra, with Nvidia claiming "2X the attention-layer acceleration and 1.5X more AI compute FLOPS." However, it was also made possible by a range of important software improvements and optimizations. Nvidia utilized its NVFP4 format extensively as part of these benchmarks, which quantized the DeepSeek R1 weights in a way that reduces the overall model size and allows Blackwell Ultra to accelerate the calculations for higher throughput while maintaining accuracy. For other benchmarks, like the larger Llama 3.1 405B model, Nvidia was able to "shard" the model across multiple GPUs at once, enabling higher throughput while maintaining latency standards. This was only possible because of its 1.8 TBps NVLink fabric between each of its 72 GPUs, for a total bandwidth of 130 TBps. All of this is part of Nvidia's pitch for Blackwell Ultra as being economically disruptive for "AI factory" development. Greater inference through improved hardware and software optimizations makes GB300 a more potentially profitable platform in Nvidia's idea of the tokenized future of data center workloads. With shipments of GB300 set to start this month, the timing of these new benchmark results seems like no coincidence.
Share
Share
Copy Link
Nvidia's latest Blackwell Ultra GB300 system showcases impressive performance in MLPerf benchmarks, particularly excelling in large language model inference tasks. The results highlight the rapid advancement in AI hardware and benchmarking standards.
Nvidia has once again demonstrated its dominance in the AI hardware space with its new Blackwell Ultra GPU, packaged in the GB300 rack-scale design. The latest MLPerf inference benchmarks, often referred to as the "Olympics of AI," have showcased Nvidia's impressive performance gains, particularly in large language model (LLM) inference tasks
1
2
.The MLPerf Inference competition has introduced three new benchmark tests, reflecting the rapid evolution of machine learning technologies. These include:
These additions bring the total count of LLM-based benchmarks to four, signaling the growing importance and diversity of language models in the AI landscape
1
.Nvidia's Blackwell Ultra GPU has demonstrated significant performance improvements over its predecessors:
2
Related Stories
Nvidia's impressive results can be attributed to both hardware improvements and software optimizations:
Hardware enhancements:
Software optimizations:
1
2
The performance gains demonstrated by Nvidia's Blackwell Ultra GB300 have significant implications for the development and deployment of AI systems:
As shipments of GB300 are set to start this month, these benchmark results position Nvidia as a leader in the rapidly evolving AI hardware market, potentially disrupting the economics of "AI factory" development
2
.Summarized by
Navi
[1]
03 Apr 2025•Technology
29 Aug 2024
14 Nov 2024•Technology
1
Business and Economy
2
Business and Economy
3
Technology