2 Sources
2 Sources
[1]
InferenceMax AI benchmark tests software stacks, efficiency, and TCO -- vendor-neutral suite runs nightly and tracks performance changes over time
In AI, much like with phones, software matters as much if not oftentimes more than the hardware. News coverage surrounding artificial intelligence almost invariably focuses on the deals that send hundreds of billions of dollars flying, or the latest hardware developments in the GPU or datacenter world. Benchmarking efforts have almost exclusively focused on the silicon, though, and that's what SemiAnalysis intends to address with its open-source InferenceMax AI benchmarking suite. It measures the efficiency of the many components of AI software stacks in real-world inference scenarios (when AI models are actually "running" rather than being trained), and publishes those at the InferenceMax live dashboard. InferenceMax is released under the Apache 2.0 license and measures the performance of hundreds of AI accelerator hardware and software combinations, in a rolling-release fashion, getting new results nightly with recent versions of the software. As the project states, existing benchmarks are done at fixed points in time and don't necessarily show what the current versions are capable of; nor do they highlight the evolution (or regression, even) of software advancements across an entire AI stack with drivers, kernels, frameworks, models, and other components. The benchmark is designed to be as neutral as possible and to mimic real-world applications. Rather than just focusing on absolute performance, InferenceMax's metrics try to reach the magic number that projects care about: TCO (total cost of ownership), in dollars per million tokens. As a simplification, a "token" is a measure of generated AI data. The basic performance measure is tokens per second GPU or user, each measure varying depending on how many requests are being served at any point. By the old adage of "fast, big, or cheap -- pick two", a high throughput (measured in tok/s/gpu), meaning optimal GPU usage, is best obtained by serving many clients at once, as LLM inference relies on matrix multiplication, which in turn benefits from batching many requests. However, serving many requests at once lowers how much time the GPU can dedicate to a single one, so getting faster output (say, in a chatbot conversation) means increasing interactivity (measured as tok/s/user) and lowering throughput. For example, if you've ever seen ChatGPT responding as if it had a bad stutter, you know what happens when throughput is set too high versus interactivity. As in any Goldilocks-type scenario, there's a perfect equilibrium between those two measures for a general-purpose setup. The ideal setup figures belong in the Pareto Frontier Curve, a specific area in a graph plotting throughput versus interactivity, handily illustrated by the diagram below. Since GPUs are purchased based on a dollar-per-hour cost when considering their price and power consumption (or when rented), the best GPU for any given scenario is not necessarily the fastest one -- it'll be the one that's most efficient. InferenceMax remarks that high-interactivity cases are pricier than high-throughput cases, although potentially more profitable, as they'll be serving more users simultaneously. The one true measure for service providers, then, is the TCO, measured in dollars per million tokens. InferenceMax attempts to estimate this figure for various scenarios, including purchasing and owning GPUs versus renting them. It's important to note that simply looking at performance graphs for a given GPU plus its associated software stack won't give you a good picture of what the best option is if all the metrics and the intended usage scenario aren't taken into consideration. Besides, InferenceMax ought to display how changes to the software stack, rather than the chips, will affect all the metrics above, and thus the TCO. As practical examples, InferenceMax remarks that AMD's MI335X is actually competitive with Nvidia's big B200 in TCO, even though the latter is way faster. On the other hand, AMD's FP4 (a 4-bit floating-point format) kernels appear to have room for improvement, as scenarios/models that depend on this math are mostly the domain of Nvidia's chips. For its 1.0 release, InferenceMax supports a mix of Nvidia's GB200, NVL72, B200, H200, and H100 accelerators, as well as AMD's Instinct MI355X, MI325X, and MI300X. The project notes that it expects to add support for Google's Tensor units and AWS Trainium in the coming months. The benchmarks are run nightly via GitHub's action runners. Both AMD and Nvidia were asked for real-world configuration sets for GPUs and the software stack, as these can be tuned thousands of different ways. While on the topic of vendor collaboration, InferenceMax thanks many people across major vendors and multiple cloud hosting providers who worked with the project, some even fixing bugs overnight. The project also uncovered multiple bugs in both Nvidia and AMD setups, highlighting the rapid pace of development and deployment of AI acceleration setups. The collaboration resulted in patches to AMD's ROCm (the equivalent of Nvidia's CUDA), with InferenceMax noting that AMD should focus on providing its users with better default configurations, as there are reportedly too many parameters that need tuning to achieve optimal performance. On the Nvidia side, the project saw some headwinds with the freshly minted Blackwell drivers, encountering some snags around initialization/termination that became apparent in benchmarking scenarios that spin instances up and down in rapid succession. If you have more than a passing interest in the area, you should read InferenceMax's announcement and write-up. It's a fun read and details the technical challenges encountered in a humorous fashion.
[2]
Nvidia Tops New AI Inference Benchmark | PYMNTS.com
By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions. The new InferenceMAX v1 benchmark measures how efficiently AI systems perform inference, the process of turning trained models into real-time outputs such as text, answers or predictions. Unlike earlier tests that focused only on raw speed, it factors in responsiveness, energy use and total cost of compute to show how much value a system can deliver for its operating cost. At the center of the results are the Blackwell B200 GPU and the GB200 NVL72 system. The B200 is a new processor built specifically for running large AI models more efficiently. The GB200 NVL72 combines multiple B200 units into a single rack-scale machine designed for data centers that need high performance and continuous operation. Nvidia said a $5 million GB200 installation can generate up to $75 million in "token revenue," a metric that estimates how much AI-generated content or data a system can produce when deployed in applications such as chatbots, analytics or recommendation engines. The more tokens a chip can generate for less energy and cost, the greater the potential return on investment. The figures show how the economics of AI are changing. As models shift from single responses to multistep reasoning, compute and energy demands increase. Nvidia's architecture aims to support this growth while keeping operating costs manageable for companies deploying AI at scale. The benchmark results arrive as rivals expand their own AI chip programs. AMD is rolling out new gen accelerators designed for data-center AI and scientific workloads. The company is partnering with cloud providers to make the chips available across shared infrastructure, offering enterprises a lower-cost alternative to Nvidia hardware. Google continues to develop its custom Tensor Processing Units, or TPUs, which power products such as Search, Gemini, and Vertex AI. The newest generation, called Ironwood, is engineered to improve efficiency when running large language models, helping Google manage computing costs and reduce its dependence on external chip suppliers. Amazon Web Services is also advancing its in-house chip strategy with Trainium2, now available through AWS. The chip is designed to lower the cost of both training and running AI models, giving businesses a more affordable path to enterprise AI adoption. These developments show how major tech firms are trying to control more of their own AI infrastructure. By building custom chips, they can tune performance for specific workloads and reduce long-term reliance on third-party hardware. Even so, Nvidia remains ahead in performance and efficiency, which continue to be the defining measures of success in AI infrastructure. Nvidia confirmed its benchmark results after the data was released, emphasizing that the performance gains were independently measured. The announcement follows a series of milestones for the company, including becoming the first U.S. firm to reach a record four trillion dollar market capitalization and launching a GPU marketplace that allows developers and enterprises to rent computing power from partners such as CoreWeave, Crusoe, and Lambda.
Share
Share
Copy Link
InferenceMax, a new AI benchmarking suite, introduces comprehensive metrics for evaluating AI inference efficiency. Nvidia's latest hardware demonstrates superior performance in the inaugural tests.
SemiAnalysis has introduced InferenceMax, an open-source AI benchmarking suite that aims to provide a more comprehensive evaluation of AI software stacks and hardware efficiency in real-world inference scenarios. Unlike traditional benchmarks that focus solely on hardware performance, InferenceMax measures the efficiency of various components within AI software stacks, offering a more holistic view of AI system capabilities
1
.InferenceMax stands out with its unique approach to benchmarking:
Real-time Updates: The suite runs nightly tests, providing up-to-date performance data on hundreds of AI accelerator hardware and software combinations
1
.Total Cost of Ownership (TCO): InferenceMax focuses on TCO, measured in dollars per million tokens, offering a practical metric for service providers
1
.Balanced Performance Metrics: The benchmark considers both throughput (tokens per second per GPU) and interactivity (tokens per second per user), striking a balance between GPU efficiency and user experience
1
.The inaugural InferenceMax v1 benchmark results have highlighted Nvidia's strong position in the AI hardware market:
Blackwell B200 GPU: Nvidia's latest processor, designed specifically for efficient large AI model execution, demonstrated exceptional performance
2
.GB200 NVL72 System: This rack-scale machine, combining multiple B200 units, showed impressive capabilities for high-performance data center operations
2
.Economic Impact: Nvidia claims that a $5 million GB200 installation can potentially generate up to $75 million in "token revenue," underscoring the significant return on investment potential for AI infrastructure
2
.Related Stories
The InferenceMax benchmark and Nvidia's performance have significant implications for the AI industry:
Evolving AI Economics: As AI models become more complex, requiring multistep reasoning, the demand for compute power and energy efficiency increases. The new benchmark helps quantify these evolving needs
2
.Competition in AI Hardware: While Nvidia leads the pack, competitors like AMD, Google, and Amazon are advancing their own AI chip programs, aiming to offer alternatives and reduce dependence on external suppliers
2
.Future Developments: InferenceMax plans to expand its support to include Google's Tensor units and AWS Trainium in the coming months, providing a more comprehensive view of the AI hardware landscape
1
.Summarized by
Navi
[1]
03 Apr 2025β’Technology
10 Sept 2025β’Technology
03 Apr 2025β’Technology