6 Sources
6 Sources
[1]
New Data Shows NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Costs for Agentic AI
Cloud providers including Microsoft, CoreWeave and Oracle Cloud Infrastructure are deploying NVIDIA GB300 NVL72 systems at scale for low-latency and long-context use cases such as agentic coding and coding assistants. The NVIDIA Blackwell platform has been widely adopted by leading inference providers such as Baseten, DeepInfra, Fireworks AI and Together AI to reduce cost per token by up to 10x. Now, the NVIDIA Blackwell Ultra platform is taking this momentum further for agentic AI. AI agents and coding assistants are driving explosive growth in software-programming-related AI queries: from 11% to about 50% last year, according to OpenRouter's State of Inference report. These applications require low latency to maintain real-time responsiveness across multistep workflows and long context when reasoning across entire codebases. New performance data shows that the combination of NVIDIA's software optimizations and the next-generation NVIDIA Blackwell Ultra platform has delivered breakthrough advances on both fronts. NVIDIA GB300 NVL72 systems now deliver up to 50x higher throughput per megawatt, resulting in 35x lower cost per token compared with the NVIDIA Hopper platform. By innovating across chips, system architecture and software, NVIDIA's extreme codesign accelerates performance across AI workloads -- from agentic coding to interactive coding assistants -- while driving down costs at scale. GB300 NVL72 Delivers up to 50x Better Performance for Low-Latency Workloads Recent analysis from Signal65 shows that NVIDIA GB200 NVL72 with extreme hardware and software codesign delivers more than 10x more tokens per watt, resulting in one-tenth the cost per token compared with the NVIDIA Hopper platform. These massive performance gains continue to expand as the underlying stack improves. Continuous optimizations from the NVIDIA TensorRT-LLM, NVIDIA Dynamo, Mooncake and SGLang teams continue to significantly boost Blackwell NVL72 throughput for mixture-of-experts (MoE) inference across all latency targets. For instance, NVIDIA TensorRT-LLM library improvements have delivered up to 5x better performance on GB200 for low-latency workloads compared with just four months ago. * Higher-performance GPU kernels optimized for efficiency and low latency help make the most of Blackwell's immense compute capabilities and boost throughput. * NVIDIA NVLink Symmetric Memory enables direct GPU-to-GPU memory access for more efficient communication. * Programmatic dependent launch minimizes idle time by launching the next kernel's setup phase before the previous one completes. Building on these software advances, GB300 NVL72 -- which features the Blackwell Ultra GPU -- pushes the throughput-per-megawatt frontier to 50x compared with the Hopper platform. This performance gain translates into superior economics, with NVIDIA GB300 lowering costs compared with the Hopper platform across the entire latency spectrum. The most dramatic reduction occurs at low latency, where agentic applications operate: up to 35x lower cost per million tokens compared with the Hopper platform. For agentic coding and interactive assistants workloads where every millisecond compounds across multistep workflows, this combination of relentless software optimization and next-generation hardware enables AI platforms to scale real-time interactive experiences to significantly more users. GB300 NVL72 Delivers Superior Economics for Long-Context Workloads While both GB200 NVL72 and GB300 NVL72 efficiently deliver ultralow latency, the distinct advantages of GB300 NVL72 become most apparent in long-context scenarios. For workloads with 128,000-token inputs and 8,000-token outputs -- such as AI coding assistants reasoning across codebases -- GB300 NVL72 delivers up to 1.5x lower cost per token compared with GB200 NVL72. Context grows as the agent reads in more of the code. This allows it to better understand the code base but also requires much more compete. Blackwell Ultra has 1.5x higher NVFP4 compute performance and 2x faster attention processing, enabling the agent to efficiently understand entire code bases. Infrastructure for Agentic AI Leading cloud providers and AI innovators have already deployed NVIDIA GB200 NVL72 at scale, and are also deploying GB300 NVL72 in production. Microsoft, CoreWeave and OCI are deploying GB300 NVL72 for low-latency and long-context use cases such as agentic coding and coding assistants. By reducing token costs, GB300 NVL72 enables a new class of applications that can reason across massive codebases in real time. "As inference moves to the center of AI production, long-context performance and token efficiency become critical," said Chen Goldberg, senior vice president of engineering at CoreWeave. "Grace Blackwell NVL72 addresses that challenge directly, and CoreWeave's AI cloud, including CKS and SUNK, is designed to translate GB300 systems' gains, building on the success of GB200, into predictable performance and cost efficiency. The result is better token economics and more usable inference for customers running workloads at scale." NVIDIA Vera Rubin NVL72 to Bring Next-Generation Performance With NVIDIA Blackwell systems deployed at scale, continuous software optimizations will keep unlocking additional performance and cost improvements across the installed base. Looking ahead, the NVIDIA Rubin platform -- which combines six new chips to create one AI supercomputer -- is set to deliver another round of massive performance leaps. For MoE inference, it delivers up to 10x higher throughput per megawatt compared with Blackwell, translating into one-tenth the cost per million tokens. And for the next wave of frontier AI models, Rubin can train large MoE models using just one-fourth the number of GPUs compared with Blackwell. Learn more about the NVIDIA Rubin platform and the Vera Rubin NVL72 system.
[2]
Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell
Baseten, DeepInfra, Fireworks AI and Together AI are reducing cost per token across industries with optimized inference stacks running on the NVIDIA Blackwell platform. A diagnostic insight in healthcare. A character's dialogue in an interactive game. An autonomous resolution from a customer service agent. Each of these AI-powered interactions is built on the same unit of intelligence: a token. Scaling these AI interactions requires businesses to consider whether they can afford more tokens. The answer lies in better tokenomics -- which at its core is about driving down the cost of each token. This downward trend is unfolding across industries. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually. To understand how infrastructure efficiency improves tokenomics, consider the analogy of a high-speed printing press. If the press produces 10x output with incremental investment in ink, energy and the machine itself, the cost to print each individual page drops. In the same way, investments in AI infrastructure can lead to far greater token output compared with the increase in cost -- causing a meaningful reduction in the cost per token. That's why leading inference providers including Baseten, DeepInfra, Fireworks AI and Together AI are using the NVIDIA Blackwell platform, which helps them reduce cost per token by up to 10x compared with the NVIDIA Hopper platform. These providers host advanced open source models, which have now reached frontier-level intelligence. By combining open source frontier intelligence, the extreme hardware-software codesign of NVIDIA Blackwell and their own optimized inference stacks, these providers are enabling dramatic token cost reductions for businesses across every industry. Healthcare -- Baseten and Sully.ai Cut AI Inference Costs by 10x In healthcare, tedious, time-consuming tasks like medical coding, documentation and managing insurance forms cut into the time doctors can spend with patients. Sully.ai helps solve this problem by developing "AI employees" that can handle routine tasks like medical coding and note-taking. As the company's platform scaled, its proprietary, closed source models created three bottlenecks: unpredictable latency in real-time clinical workflows, inference costs that scaled faster than revenue and insufficient control over model quality and updates. To overcome these bottlenecks, Sully.ai uses Baseten's Model API, which deploys open source models such as gpt-oss-120b on NVIDIA Blackwell GPUs. Baseten used the low-precision NVFP4 data format, the NVIDIA TensorRT-LLM library and the NVIDIA Dynamo inference framework to deliver optimized inference. The company chose NVIDIA Blackwell to run its Model API after seeing up to 2.5x better throughput per dollar compared with the NVIDIA Hopper platform. As a result, Sully.ai's inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes. The company has now returned over 30 million minutes to physicians, time previously lost to data entry and other manual tasks. Gaming -- DeepInfra and Latitude Reduce Cost per Token by 4x Latitude is building the future of AI-native gaming with its AI Dungeon adventure-story game and upcoming AI-powered role-playing gaming platform, Voyage, where players can create or play worlds with the freedom to choose any action and make their own story. The company's platform uses large language models to respond to players' actions -- but this comes with scaling challenges, as every player action triggers an inference request. Costs scale with engagement, and response times must stay fast enough to keep the experience seamless. Latitude runs large open source models on DeepInfra's inference platform, powered by NVIDIA Blackwell GPUs and TensorRT-LLM. For a large-scale mixture-of-experts (MoE) model, DeepInfra reduced the cost per million tokens from 20 cents on the NVIDIA Hopper platform to 10 cents on Blackwell. Moving to Blackwell's native low-precision NVFP4 format further cut that cost to just 5 cents -- for a total 4x improvement in cost per token -- while maintaining the accuracy that customers expect. Running these large-scale MoE models on DeepInfra's Blackwell-powered platform allows Latitude to deliver fast, reliable responses cost effectively. DeepInfra inference platform delivers this performance while reliably handling traffic spikes, letting Latitude deploy more capable models without compromising player experience. Agentic Chat -- Fireworks AI and Sentient Foundation Lower AI Costs by up to 50% Sentient Labs is focused on bringing AI developers together to build powerful reasoning AI systems that are all open source. The goal is to accelerate AI toward solving harder reasoning problems through research in secure autonomy, agentic architecture and continual learning. Its first app, Sentient Chat, orchestrates complex multi-agent workflows and integrates more than a dozen specialized AI agents from the community. Due to this, Sentient Chat has massive compute demands because a single user query could trigger a cascade of autonomous interactions that typically lead to costly infrastructure overhead. To manage this scale and complexity, Sentient uses Fireworks AI's inference platform running on NVIDIA Blackwell. With Fireworks' Blackwell-optimized inference stack, Sentient achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment. This higher throughput per GPU allowed the company to serve significantly more concurrent users for the same cost. The platform's scalability supported a viral launch of 1.8 million waitlisted users in 24 hours and processed 5.6 million queries in a single week while delivering consistent low latency. Customer Service -- Together AI and Decagon Drive Down Cost by 6x Customer service calls with voice AI often end in frustration because even a slight delay can lead users to talk over the agent, hang up or lose trust. Decagon builds AI agents for enterprise customer support, with AI-powered voice being its most demanding channel. Decagon needed infrastructure that could deliver sub-second responses under unpredictable traffic loads with tokenomics that supported 24/7 voice deployments. Together AI runs production inference for Decagon's multimodel voice stack on NVIDIA Blackwell GPUs. The companies collaborated on several key optimizations: speculative decoding that trains smaller models to generate faster responses while a larger model verifies accuracy in the background, caching repeated conversation elements to speed up responses and building automatic scaling that handles traffic surges without degrading performance. Decagon saw response times under 400 milliseconds even when processing thousands of tokens per query. Cost per query, which is the total cost to complete one voice interaction, dropped by 6x compared with using closed source proprietary models. This was achieved through the combination of Decagon's multimodel approach (some open source, some trained in house on NVIDIA GPUs), NVIDIA Blackwell's extreme codesign and Together's optimized inference stack. Optimizing Tokenomics With Extreme Codesign The dramatic cost savings seen across healthcare, gaming and customer service are driven by the efficiency of NVIDIA Blackwell. The NVIDIA GB200 NVL72 system further scales this impact by delivering a breakthrough 10x reduction in cost per token for reasoning MoE models compared with NVIDIA Hopper. NVIDIA's extreme codesign across every layer of the stack -- spanning compute, networking and software -- and its partner ecosystem are unlocking massive reductions in cost per token at scale. This momentum continues with the NVIDIA Rubin platform -- integrating six new chips into a single AI supercomputer to deliver 10x performance and 10x lower token cost over Blackwell. Explore NVIDIA's full-stack inference platform to learn more about how it delivers better tokenomics for AI inference.
[3]
AI inference costs dropped up to 10x on Nvidia's Blackwell -- but hardware is only half the equation
Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token. The dramatic cost reductions were achieved using Nvidia's Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users. The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates. The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs. "Performance is what drives down the cost of inference," Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. "What we're seeing in inference is that throughput literally translates into real dollar value and driving down the cost." Production deployments show 4x to 10x cost reductions Nvidia detailed four customer deployments in a blog post showing how the combination of Blackwell infrastructure, optimized software stacks and open-source models delivers cost reductions across different industry workloads. The case studies span high-volume applications where inference economics directly determines business viability. Sully.ai cut healthcare AI inference costs by 90% (a 10x reduction) while improving response times 65% by switching from proprietary models to open-source models running on Baseten's Blackwell-powered platform, according to Nvidia. The company returned over 30 million minutes to physicians by automating medical coding and note-taking tasks that previously required manual data entry. Nvidia also reported that Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large mixture-of-experts (MoE) models on DeepInfra's Blackwell deployment. Cost per million tokens dropped from 20 cents on Nvidia's previous Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell's native NVFP4 low-precision format. Hardware alone delivered 2x improvement, but reaching 4x required the precision format change. Sentient Foundation achieved 25% to 50% better cost efficiency for its agentic chat platform using Fireworks AI's Blackwell-optimized inference stack, according to Nvidia. The platform orchestrates complex multi-agent workflows and processed 5.6 million queries in a single week during its viral launch while maintaining low latency. Nvidia said Decagon saw 6x cost reduction per query for AI-powered voice customer support by running its multimodel stack on Together AI's Blackwell infrastructure. Response times stayed under 400 milliseconds, even when processing thousands of tokens per query, critical for voice interactions where delays cause users to hang up or lose trust. Technical factors driving 4x versus 10x improvements The range from 4x to 10x cost reductions across deployments reflects different combinations of technical optimizations rather than just hardware differences. Three factors emerge as primary drivers: precision format adoption, model architecture choices, and software stack integration. Precision formats show the clearest impact. Latitude's case demonstrates this directly. Moving from Hopper to Blackwell delivered 2x cost reduction through hardware improvements. Adopting NVFP4, Blackwell's native low-precision format, doubled that improvement to 4x total. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy. The format works particularly well for MoE models where only a subset of the model activates for each inference request. Model architecture matters. MoE models, which activate different specialized sub-models based on input, benefit from Blackwell's NVLink fabric that enables rapid communication between experts. "Having those experts communicate across that NVLink fabric allows you to reason very quickly," Harris said. Dense models that activate all parameters for every inference don't leverage this architecture as effectively. Software stack integration creates additional performance deltas. Harris said that Nvidia's co-design approach -- where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together -- also makes a difference. Baseten's deployment for Sully.ai used this integrated stack, combining NVFP4, TensorRT-LLM and Dynamo to achieve the 10x cost reduction. Providers running alternative frameworks like vLLM may see lower gains. Workload characteristics matter. Reasoning models show particular advantages on Blackwell because they generate significantly more tokens to reach better answers. The platform's ability to process these extended token sequences efficiently through disaggregated serving, where context prefill and token generation are handled separately, makes reasoning workloads cost-effective. Teams evaluating potential cost reductions should examine their workload profiles against these factors. High token generation workloads using mixture-of-experts models with the integrated Blackwell software stack will approach the 10x range. Lower token volumes using dense models on alternative frameworks will land closer to 4x. What teams should test before migrating While these case studies focus on Nvidia Blackwell deployments, enterprises have multiple paths to reducing inference costs. AMD's MI300 series, Google TPUs, and specialized inference accelerators from Groq and Cerebras offer alternative architectures. Cloud providers also continue optimizing their inference services. The question isn't whether Blackwell is the only option but whether the specific combination of hardware, software and models fits particular workload requirements. Enterprises considering Blackwell-based inference should start by calculating whether their workloads justify infrastructure changes. "Enterprises need to work back from their workloads and use case and cost constraints," Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat. The deployments achieving 6x to 10x improvements all involved high-volume, latency-sensitive applications processing millions of requests monthly. Teams running lower volumes or applications with latency budgets exceeding one second should explore software optimization or model switching before considering infrastructure upgrades. Testing matters more than provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but these represent ideal conditions. "If it's a highly latency-sensitive workload, they might want to test a couple of providers and see who meets the minimum they need while keeping the cost down," she said. Teams should run actual production workloads across multiple Blackwell providers to measure real performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks. The staged approach Latitude used provides a model for evaluation. The company first moved to Blackwell hardware and measured 2x improvement, then adopted NVFP4 format to reach 4x total reduction. Teams currently on Hopper or other infrastructure can test whether precision format changes and software optimization on existing hardware capture meaningful savings before committing to full infrastructure migrations. Running open source models on current infrastructure might deliver half the potential cost reduction without new hardware investments. Provider selection requires understanding software stack differences. While multiple providers offer Blackwell infrastructure, their software implementations vary. Some run Nvidia's integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like vLLM. Harris acknowledges performance deltas exist between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements rather than assuming all Blackwell deployments perform identically. The economic equation extends beyond cost per token. Specialized inference providers like Baseten, DeepInfra, Fireworks and Together offer optimized deployments but require managing additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have higher per-token costs but lower operational complexity. Teams should calculate total cost including operational overhead, not just inference pricing, to determine which approach delivers better economics for their specific situation.
[4]
NVIDIA Blackwell Ultra delivers 50x higher efficiency for agentic AI
Nvidia's new benchmark data reveals that GB300 NVL72 systems equipped with Blackwell Ultra GPUs achieve up to 50x higher throughput per megawatt and 35x lower cost per token compared to the Hopper platform for low-latency AI workloads. The metrics reflect combined hardware and software advancements targeting agentic AI and coding assistant deployments. Performance gains derive from specific architectural changes and library optimizations that address transformer attention layer bottlenecks. These efficiency improvements reduce operational costs for cloud providers and inference services, enabling broader deployment of compute-intensive models. Blackwell Ultra Tensor Cores provide 1.5x greater compute performance than standard Blackwell GPUs. The architecture doubles attention-layer processing via accelerated softmax execution, directly supporting reasoning models that utilize large context windows. Nvidia's TensorRT-LLM inference library has recorded sustained performance increases, with SemiAnalysis benchmarks documenting that throughput per GPU doubled at certain interactivity levels since October 2025. The company states that these developments deliver a 10x increase in tokens per second per user and a 5x improvement in tokens per second per megawatt relative to Hopper. Cumulatively, these factors produce the 50x rise in AI factory output. Chen Goldberg, senior vice president of engineering at CoreWeave, emphasized the operational focus of these advancements. "As inference moves to the center of AI production, long-context performance and token efficiency become critical," Goldberg stated. "Grace Blackwell NVL72 addresses that challenge directly." CoreWeave announced in 2025 that it was the first AI cloud provider to deploy GB300 NVL72 systems in production, integrating the hardware with its Kubernetes-based cloud stack. Microsoft subsequently deployed what it describes as the world's first large-scale GB300 NVL72 supercomputing cluster. Testing validated by Signal65 recorded the cluster achieving over 1.1 million tokens per second on a single rack. Oracle's OCI platform is deploying GB300 NVL72 systems with plans to scale Superclusters beyond 100,000 Blackwell GPUs to support inference workload demand. Leading inference providers, including Baseten, DeepInfra, Fireworks AI, and Together AI, reported up to 10x cost reductions using the standard Blackwell platform. The Blackwell Ultra platform extends these efficiencies to workloads requiring low latency, achieving a 35x lower cost per million tokens. This reduction facilitates the economically viable deployment of AI agents and coding assistants at scale. Nvidia has previewed its next-generation Rubin platform, projecting a 10x performance improvement over Blackwell.
[5]
NVIDIA's Blackwell Ultra Pushes "Agentic AI" Performance to New Heights, Delivering Up to 50× Higher Tokens/Watt & Stronger Long-Context Workloads
NVIDIA's Blackwell Ultra is the modern-day computing option for hyperscalers, and in newer benchmarks, the GB300 NVL72 shows immense performance in low-latency and long context workloads. The AI industry has evolved across multiple layers since its original boom back in 2022, and right now, we are seeing a major shift towards agentic computing, driven by applications/wrappers built on frontier models. At the same time, for infrastructure providers like NVIDIA, it has become increasingly important to have ample memory bandwidth and performance onboard to meet the latency requirements of agentic frameworks, and with Blackwell Ultra, Team Green has done just that. In a new blog post, NVIDIA tested Blackwell Ultra on SemiAnalysis's InferenceMAX, and the results are astonishing. NVIDIA's first infographic emphasizes a figure called "token/watt", which is probably one of the world's most important numbers to look at with the current hyperscaler buildout. The company has focused on both raw performance and throughput optimizations, which is why, with GB300 NVL72, NVIDIA sees a 50x increase in throughput per megawatt compared to Hopper GPUs. The comparison below shows the best possible 'deployed state' for each architecture. If you are curious about how the throughput-per-megawatt gains are so phenomenal, well, NVIDIA takes pride in its NVLink technology. Blackwell Ultra has expanded to a 72-GPU front, joining them into a single, unified NVLink fabric with 130 TB/s of connectivity. Compared to Hopper, which is confined to an 8-chip NVLink design, NVIDIA has brought in superior architecture, rack design, and, more importantly, the NVFP4 precision format, which is why GB300 dominates in throughput. Given the "agentic AI" wave, NVIDIA's GB300 NVL72 testing also focuses on token costs and on the upgrades mentioned above. Team Green sees a massive 35x reduction in cost per million tokens, making it the go-to inference option for frontier labs and hyperscalers. Yet again, scaling laws remain intact and are evolving at a pace no one would've imagined, and the major catalysts for these performance upgrades are indeed the "extreme co-design" structure NVIDIA has in place, along with, of course, what we call Huang's Law. The comparison with Hopper becomes a bit unfair when you factor in the incremental differences in compute nodes and architectures, which is why NVIDIA has also compared the GB200 with the GB300 (NVL72s) across long-context workloads. Context is indeed the following major constraint for agents, given that in order to maintain a state of the entire codebase, the token usage rises aggressively. With Blackwell Ultra, NVIDIA sees up to 1.5x lower cost per token and 2x faster attention processing, making it well-positioned for agentic workloads. Given that Blackwell Ultra is currently in the process of hyperscaler integrations, these are among the first benchmarks of the architecture, and by the looks of it, NVIDIA has managed to keep performance scaling intact and aligned with modern-day AI use cases. And, with Vera Rubin, one could expect even superior performance from the Blackwell generation, making it one of the many reasons why NVIDIA currently dominates the infrastructure race.
[6]
NVIDIA Has Managed to Reduce Token Costs by a Whopping 10x With Its Newest Blackwell Platform, Credited to Team Green's "Extreme Codesign" Approach
NVIDIA's Blackwell platform has brought new levels of token optimization to AI inference workloads, as the company reveals a massive milestone in the realm of tokenomics. While NVIDIA has been racing to build new infrastructure in the AI world, one of the company's biggest focuses has been improving the efficiency of the hardware it deploys. And, with the Blackwell-trained frontier AI models dropping in the industry, we have seen how NVIDIA has progressed with token output and costs, and now, in a new blog post, the company has revealed that they have been working with businesses to scale up Blackwell performance, revealing a significant ten-fold improvement over the Hopper generation. That's why leading inference providers including Baseten, DeepInfra, Fireworks AI and Together AI are using the NVIDIA Blackwell platform, which helps them reduce cost per token by up to 10x compared with the NVIDIA Hopper platform. These providers host advanced open source models, which have now reached frontier-level intelligence. By combining open source frontier intelligence, the extreme hardware-software codesign of NVIDIA Blackwell and their own optimized inference stacks, these providers are enabling dramatic token cost reductions for businesses across every industry. - NVIDIA While discussing tokenomics on Blackwell, NVIDIA has labeled organizations like Baseten and Sully.ai, along with the gaming-focused DeepInfra and Latitude. For each company, the Blackwell architecture has enabled them to achieve lower latency, optimal inference costs, and reliable responses, which is why the tech stack is the go-to option for mainstream AI companies today. Even in multi-agent workflows and deploying specialized AI agents, a company called Sentient Labs has achieved "25-50% better cost efficiency" relative to Hopper. NVIDIA's progress with the Blackwell AI architecture is driven by the company's "extreme co-design" approach, which is optimal for today's MoE architectures. With GB200 NVL72, NVIDIA uses a 72-chip configuration coupled with 30TB of fast shared memory to take expert parallelism to a whole new level, ensuring that token batches are constantly split and scattered across GPUs, and that communication volume increases at a non-linear rate. This is one of the reasons why tokenomics will be Blackwell's most efficient figures yet. With Vera Rubin, Team Green plans to take infrastructure efficiency to a whole new level, driven by architecture advancements, specialized mechanisms like CPX for prefill, and much more. The world of AI is evolving at an overwhelming pace, which is why we need to recognize that optimizing hardware is as important as developing new ones.
Share
Share
Copy Link
NVIDIA's GB300 NVL72 systems powered by Blackwell Ultra GPUs achieve up to 50x higher throughput per megawatt and 35x lower cost per token compared to the Hopper platform. Cloud providers including Microsoft, CoreWeave, and Oracle are deploying these systems at scale for agentic AI and coding assistants, while leading inference providers report 4x to 10x cost reductions using open-source models.
NVIDIA has released new performance data showing that its GB300 NVL72 systems equipped with Blackwell Ultra GPUs achieve up to 50x higher throughput per megawatt and 35x lower cost per token compared to the NVIDIA Hopper platform for low-latency workloads
1
. These dramatic efficiency gains target agentic AI applications and AI coding assistants, which have driven explosive growth in software-programming-related AI queries from 11% to approximately 50% last year, according to OpenRouter's State of Inference report1
.
Source: Wccftech
The performance improvements stem from extreme hardware-software codesign that addresses transformer attention layer bottlenecks. Blackwell Ultra Tensor Cores provide 1.5x greater compute performance than standard NVIDIA Blackwell GPUs, while the architecture doubles attention-layer processing through accelerated softmax execution
4
. Cloud providers including Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying GB300 NVL72 systems in production for low-latency and long-context workloads such as agentic coding1
.Continuous software optimizations from NVIDIA's TensorRT-LLM, NVIDIA Dynamo, Mooncake, and SGLang teams have significantly boosted Blackwell NVL72 throughput for Mixture-of-Experts (MoE) inference across all latency targets
1
. The TensorRT-LLM library improvements alone have delivered up to 5x better performance on GB200 for low-latency workloads compared with just four months ago1
.Key software optimizations include higher-performance GPU kernels optimized for efficiency and low latency, NVLink Symmetric Memory enabling direct GPU-to-GPU memory access, and programmatic dependent launch that minimizes idle time
1
. SemiAnalysis benchmarks documented that throughput per GPU doubled at certain interactivity levels since October 2025, with NVIDIA stating these developments deliver a 10x increase in tokens per second per user and a 5x improvement in tokens per second per megawatt relative to Hopper4
.Leading inference providers including Baseten, DeepInfra, Fireworks AI, and Together AI are reducing AI inference costs by up to 10x using open-source models on the NVIDIA Blackwell platform
2
. Production deployment data shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users3
.
Source: VentureBeat
The 4x to 10x cost reductions required combining Blackwell hardware with optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence
3
. Hardware improvements alone delivered 2x gains in some deployments, but reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed-source APIs that charge premium rates3
.Sully.ai cut healthcare AI inference costs by 90%, representing a 10x reduction, while improving response times by 65% for critical workflows like generating medical notes by switching from proprietary models to open-source models running on Baseten's Blackwell-powered platform
2
. The company has returned over 30 million minutes to physicians, time previously lost to data entry and manual tasks2
.Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large MoE models on DeepInfra's Blackwell deployment
2
. Cost per million tokens dropped from 20 cents on the NVIDIA Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell's native NVFP4 low-precision format2
. Sentient Foundation achieved 25% to 50% better cost efficiency using Fireworks AI's Blackwell-optimized inference stack, processing 5.6 million queries in a single week during its viral launch3
.Related Stories
For long-context workloads with 128,000-token inputs and 8,000-token outputs, such as AI coding assistants reasoning across entire codebases, GB300 NVL72 delivers up to 1.5x lower cost per token compared with GB200 NVL72
1
. Blackwell Ultra's 1.5x higher NVFP4 compute performance and 2x faster attention processing enable agents to efficiently understand entire codebases1
.Chen Goldberg, senior vice president of engineering at CoreWeave, stated: "As inference moves to the center of AI production, long-context performance and token efficiency become critical. Grace Blackwell NVL72 addresses that challenge directly"
1
. CoreWeave was the first AI cloud provider to deploy GB300 NVL72 systems in production4
. Microsoft subsequently deployed what it describes as the world's first large-scale GB300 NVL72 supercomputing cluster, with testing validated by Signal65 recording the cluster achieving over 1.1 million tokens per second on a single rack4
.
Source: NVIDIA
Blackwell Ultra has expanded to a 72-GPU configuration, joining them into a single unified NVLink fabric with 130 TB/s of connectivity
5
. Compared to Hopper, which is confined to an 8-chip NVLink design, NVIDIA has brought superior architecture, rack design, and the NVFP4 precision format, which explains why GB300 dominates in throughput5
."Performance is what drives down the cost of inference," said Dion Harris, senior director of HPC and AI hyperscaler solutions at NVIDIA. "What we're seeing in inference is that throughput literally translates into real dollar value and driving down the cost"
3
. Oracle's OCI platform is deploying GB300 NVL72 systems with plans to scale Superclusters beyond 100,000 Blackwell GPUs to support inference workload demand4
. NVIDIA has previewed its next-generation Rubin platform, projecting a 10x performance improvement over Blackwell4
.Summarized by
Navi
[2]
[3]
19 Mar 2025•Technology

10 Sept 2025•Technology

29 Aug 2024

1
Technology

2
Policy and Regulation

3
Business and Economy
