3 Sources
3 Sources
[1]
Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell
Baseten, DeepInfra, Fireworks AI and Together AI are reducing cost per token across industries with optimized inference stacks running on the NVIDIA Blackwell platform. A diagnostic insight in healthcare. A character's dialogue in an interactive game. An autonomous resolution from a customer service agent. Each of these AI-powered interactions is built on the same unit of intelligence: a token. Scaling these AI interactions requires businesses to consider whether they can afford more tokens. The answer lies in better tokenomics -- which at its core is about driving down the cost of each token. This downward trend is unfolding across industries. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually. To understand how infrastructure efficiency improves tokenomics, consider the analogy of a high-speed printing press. If the press produces 10x output with incremental investment in ink, energy and the machine itself, the cost to print each individual page drops. In the same way, investments in AI infrastructure can lead to far greater token output compared with the increase in cost -- causing a meaningful reduction in the cost per token. That's why leading inference providers including Baseten, DeepInfra, Fireworks AI and Together AI are using the NVIDIA Blackwell platform, which helps them reduce cost per token by up to 10x compared with the NVIDIA Hopper platform. These providers host advanced open source models, which have now reached frontier-level intelligence. By combining open source frontier intelligence, the extreme hardware-software codesign of NVIDIA Blackwell and their own optimized inference stacks, these providers are enabling dramatic token cost reductions for businesses across every industry. Healthcare -- Baseten and Sully.ai Cut AI Inference Costs by 10x In healthcare, tedious, time-consuming tasks like medical coding, documentation and managing insurance forms cut into the time doctors can spend with patients. Sully.ai helps solve this problem by developing "AI employees" that can handle routine tasks like medical coding and note-taking. As the company's platform scaled, its proprietary, closed source models created three bottlenecks: unpredictable latency in real-time clinical workflows, inference costs that scaled faster than revenue and insufficient control over model quality and updates. To overcome these bottlenecks, Sully.ai uses Baseten's Model API, which deploys open source models such as gpt-oss-120b on NVIDIA Blackwell GPUs. Baseten used the low-precision NVFP4 data format, the NVIDIA TensorRT-LLM library and the NVIDIA Dynamo inference framework to deliver optimized inference. The company chose NVIDIA Blackwell to run its Model API after seeing up to 2.5x better throughput per dollar compared with the NVIDIA Hopper platform. As a result, Sully.ai's inference costs dropped by 90%, representing a 10x reduction compared with the prior closed source implementation, while response times improved by 65% for critical workflows like generating medical notes. The company has now returned over 30 million minutes to physicians, time previously lost to data entry and other manual tasks. Gaming -- DeepInfra and Latitude Reduce Cost per Token by 4x Latitude is building the future of AI-native gaming with its AI Dungeon adventure-story game and upcoming AI-powered role-playing gaming platform, Voyage, where players can create or play worlds with the freedom to choose any action and make their own story. The company's platform uses large language models to respond to players' actions -- but this comes with scaling challenges, as every player action triggers an inference request. Costs scale with engagement, and response times must stay fast enough to keep the experience seamless. Latitude runs large open source models on DeepInfra's inference platform, powered by NVIDIA Blackwell GPUs and TensorRT-LLM. For a large-scale mixture-of-experts (MoE) model, DeepInfra reduced the cost per million tokens from 20 cents on the NVIDIA Hopper platform to 10 cents on Blackwell. Moving to Blackwell's native low-precision NVFP4 format further cut that cost to just 5 cents -- for a total 4x improvement in cost per token -- while maintaining the accuracy that customers expect. Running these large-scale MoE models on DeepInfra's Blackwell-powered platform allows Latitude to deliver fast, reliable responses cost effectively. DeepInfra inference platform delivers this performance while reliably handling traffic spikes, letting Latitude deploy more capable models without compromising player experience. Agentic Chat -- Fireworks AI and Sentient Foundation Lower AI Costs by up to 50% Sentient Labs is focused on bringing AI developers together to build powerful reasoning AI systems that are all open source. The goal is to accelerate AI toward solving harder reasoning problems through research in secure autonomy, agentic architecture and continual learning. Its first app, Sentient Chat, orchestrates complex multi-agent workflows and integrates more than a dozen specialized AI agents from the community. Due to this, Sentient Chat has massive compute demands because a single user query could trigger a cascade of autonomous interactions that typically lead to costly infrastructure overhead. To manage this scale and complexity, Sentient uses Fireworks AI's inference platform running on NVIDIA Blackwell. With Fireworks' Blackwell-optimized inference stack, Sentient achieved 25-50% better cost efficiency compared with its previous Hopper-based deployment. This higher throughput per GPU allowed the company to serve significantly more concurrent users for the same cost. The platform's scalability supported a viral launch of 1.8 million waitlisted users in 24 hours and processed 5.6 million queries in a single week while delivering consistent low latency. Customer Service -- Together AI and Decagon Drive Down Cost by 6x Customer service calls with voice AI often end in frustration because even a slight delay can lead users to talk over the agent, hang up or lose trust. Decagon builds AI agents for enterprise customer support, with AI-powered voice being its most demanding channel. Decagon needed infrastructure that could deliver sub-second responses under unpredictable traffic loads with tokenomics that supported 24/7 voice deployments. Together AI runs production inference for Decagon's multimodel voice stack on NVIDIA Blackwell GPUs. The companies collaborated on several key optimizations: speculative decoding that trains smaller models to generate faster responses while a larger model verifies accuracy in the background, caching repeated conversation elements to speed up responses and building automatic scaling that handles traffic surges without degrading performance. Decagon saw response times under 400 milliseconds even when processing thousands of tokens per query. Cost per query, which is the total cost to complete one voice interaction, dropped by 6x compared with using closed source proprietary models. This was achieved through the combination of Decagon's multimodel approach (some open source, some trained in house on NVIDIA GPUs), NVIDIA Blackwell's extreme codesign and Together's optimized inference stack. Optimizing Tokenomics With Extreme Codesign The dramatic cost savings seen across healthcare, gaming and customer service are driven by the efficiency of NVIDIA Blackwell. The NVIDIA GB200 NVL72 system further scales this impact by delivering a breakthrough 10x reduction in cost per token for reasoning MoE models compared with NVIDIA Hopper. NVIDIA's extreme codesign across every layer of the stack -- spanning compute, networking and software -- and its partner ecosystem are unlocking massive reductions in cost per token at scale. This momentum continues with the NVIDIA Rubin platform -- integrating six new chips into a single AI supercomputer to deliver 10x performance and 10x lower token cost over Blackwell. Explore NVIDIA's full-stack inference platform to learn more about how it delivers better tokenomics for AI inference.
[2]
AI inference costs dropped up to 10x on Nvidia's Blackwell -- but hardware is only half the equation
Lowering the cost of inference is typically a combination of hardware and software. A new analysis released Thursday by Nvidia details how four leading inference providers are reporting 4x to 10x reductions in cost per token. The dramatic cost reductions were achieved using Nvidia's Blackwell platform with open-source models. Production deployment data from Baseten, DeepInfra, Fireworks AI and Together AI shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users. The 4x to 10x cost reductions reported by inference providers required combining Blackwell hardware with two other elements: optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence. Hardware improvements alone delivered 2x gains in some deployments, according to the analysis. Reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed source APIs that charge premium rates. The economics prove counterintuitive. Reducing inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs. "Performance is what drives down the cost of inference," Dion Harris, senior director of HPC and AI hyperscaler solutions at Nvidia, told VentureBeat in an exclusive interview. "What we're seeing in inference is that throughput literally translates into real dollar value and driving down the cost." Production deployments show 4x to 10x cost reductions Nvidia detailed four customer deployments in a blog post showing how the combination of Blackwell infrastructure, optimized software stacks and open-source models delivers cost reductions across different industry workloads. The case studies span high-volume applications where inference economics directly determines business viability. Sully.ai cut healthcare AI inference costs by 90% (a 10x reduction) while improving response times 65% by switching from proprietary models to open-source models running on Baseten's Blackwell-powered platform, according to Nvidia. The company returned over 30 million minutes to physicians by automating medical coding and note-taking tasks that previously required manual data entry. Nvidia also reported that Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large mixture-of-experts (MoE) models on DeepInfra's Blackwell deployment. Cost per million tokens dropped from 20 cents on Nvidia's previous Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell's native NVFP4 low-precision format. Hardware alone delivered 2x improvement, but reaching 4x required the precision format change. Sentient Foundation achieved 25% to 50% better cost efficiency for its agentic chat platform using Fireworks AI's Blackwell-optimized inference stack, according to Nvidia. The platform orchestrates complex multi-agent workflows and processed 5.6 million queries in a single week during its viral launch while maintaining low latency. Nvidia said Decagon saw 6x cost reduction per query for AI-powered voice customer support by running its multimodel stack on Together AI's Blackwell infrastructure. Response times stayed under 400 milliseconds, even when processing thousands of tokens per query, critical for voice interactions where delays cause users to hang up or lose trust. Technical factors driving 4x versus 10x improvements The range from 4x to 10x cost reductions across deployments reflects different combinations of technical optimizations rather than just hardware differences. Three factors emerge as primary drivers: precision format adoption, model architecture choices, and software stack integration. Precision formats show the clearest impact. Latitude's case demonstrates this directly. Moving from Hopper to Blackwell delivered 2x cost reduction through hardware improvements. Adopting NVFP4, Blackwell's native low-precision format, doubled that improvement to 4x total. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy. The format works particularly well for MoE models where only a subset of the model activates for each inference request. Model architecture matters. MoE models, which activate different specialized sub-models based on input, benefit from Blackwell's NVLink fabric that enables rapid communication between experts. "Having those experts communicate across that NVLink fabric allows you to reason very quickly," Harris said. Dense models that activate all parameters for every inference don't leverage this architecture as effectively. Software stack integration creates additional performance deltas. Harris said that Nvidia's co-design approach -- where Blackwell hardware, NVL72 scale-up architecture, and software like Dynamo and TensorRT-LLM are optimized together -- also makes a difference. Baseten's deployment for Sully.ai used this integrated stack, combining NVFP4, TensorRT-LLM and Dynamo to achieve the 10x cost reduction. Providers running alternative frameworks like vLLM may see lower gains. Workload characteristics matter. Reasoning models show particular advantages on Blackwell because they generate significantly more tokens to reach better answers. The platform's ability to process these extended token sequences efficiently through disaggregated serving, where context prefill and token generation are handled separately, makes reasoning workloads cost-effective. Teams evaluating potential cost reductions should examine their workload profiles against these factors. High token generation workloads using mixture-of-experts models with the integrated Blackwell software stack will approach the 10x range. Lower token volumes using dense models on alternative frameworks will land closer to 4x. What teams should test before migrating While these case studies focus on Nvidia Blackwell deployments, enterprises have multiple paths to reducing inference costs. AMD's MI300 series, Google TPUs, and specialized inference accelerators from Groq and Cerebras offer alternative architectures. Cloud providers also continue optimizing their inference services. The question isn't whether Blackwell is the only option but whether the specific combination of hardware, software and models fits particular workload requirements. Enterprises considering Blackwell-based inference should start by calculating whether their workloads justify infrastructure changes. "Enterprises need to work back from their workloads and use case and cost constraints," Shruti Koparkar, AI product marketing at Nvidia, told VentureBeat. The deployments achieving 6x to 10x improvements all involved high-volume, latency-sensitive applications processing millions of requests monthly. Teams running lower volumes or applications with latency budgets exceeding one second should explore software optimization or model switching before considering infrastructure upgrades. Testing matters more than provider specifications. Koparkar emphasizes that providers publish throughput and latency metrics, but these represent ideal conditions. "If it's a highly latency-sensitive workload, they might want to test a couple of providers and see who meets the minimum they need while keeping the cost down," she said. Teams should run actual production workloads across multiple Blackwell providers to measure real performance under their specific usage patterns and traffic spikes rather than relying on published benchmarks. The staged approach Latitude used provides a model for evaluation. The company first moved to Blackwell hardware and measured 2x improvement, then adopted NVFP4 format to reach 4x total reduction. Teams currently on Hopper or other infrastructure can test whether precision format changes and software optimization on existing hardware capture meaningful savings before committing to full infrastructure migrations. Running open source models on current infrastructure might deliver half the potential cost reduction without new hardware investments. Provider selection requires understanding software stack differences. While multiple providers offer Blackwell infrastructure, their software implementations vary. Some run Nvidia's integrated stack using Dynamo and TensorRT-LLM, while others use frameworks like vLLM. Harris acknowledges performance deltas exist between these configurations. Teams should evaluate what each provider actually runs and how it matches their workload requirements rather than assuming all Blackwell deployments perform identically. The economic equation extends beyond cost per token. Specialized inference providers like Baseten, DeepInfra, Fireworks and Together offer optimized deployments but require managing additional vendor relationships. Managed services from AWS, Azure or Google Cloud may have higher per-token costs but lower operational complexity. Teams should calculate total cost including operational overhead, not just inference pricing, to determine which approach delivers better economics for their specific situation.
[3]
NVIDIA Has Managed to Reduce Token Costs by a Whopping 10x With Its Newest Blackwell Platform, Credited to Team Green's "Extreme Codesign" Approach
NVIDIA's Blackwell platform has brought new levels of token optimization to AI inference workloads, as the company reveals a massive milestone in the realm of tokenomics. While NVIDIA has been racing to build new infrastructure in the AI world, one of the company's biggest focuses has been improving the efficiency of the hardware it deploys. And, with the Blackwell-trained frontier AI models dropping in the industry, we have seen how NVIDIA has progressed with token output and costs, and now, in a new blog post, the company has revealed that they have been working with businesses to scale up Blackwell performance, revealing a significant ten-fold improvement over the Hopper generation. That's why leading inference providers including Baseten, DeepInfra, Fireworks AI and Together AI are using the NVIDIA Blackwell platform, which helps them reduce cost per token by up to 10x compared with the NVIDIA Hopper platform. These providers host advanced open source models, which have now reached frontier-level intelligence. By combining open source frontier intelligence, the extreme hardware-software codesign of NVIDIA Blackwell and their own optimized inference stacks, these providers are enabling dramatic token cost reductions for businesses across every industry. - NVIDIA While discussing tokenomics on Blackwell, NVIDIA has labeled organizations like Baseten and Sully.ai, along with the gaming-focused DeepInfra and Latitude. For each company, the Blackwell architecture has enabled them to achieve lower latency, optimal inference costs, and reliable responses, which is why the tech stack is the go-to option for mainstream AI companies today. Even in multi-agent workflows and deploying specialized AI agents, a company called Sentient Labs has achieved "25-50% better cost efficiency" relative to Hopper. NVIDIA's progress with the Blackwell AI architecture is driven by the company's "extreme co-design" approach, which is optimal for today's MoE architectures. With GB200 NVL72, NVIDIA uses a 72-chip configuration coupled with 30TB of fast shared memory to take expert parallelism to a whole new level, ensuring that token batches are constantly split and scattered across GPUs, and that communication volume increases at a non-linear rate. This is one of the reasons why tokenomics will be Blackwell's most efficient figures yet. With Vera Rubin, Team Green plans to take infrastructure efficiency to a whole new level, driven by architecture advancements, specialized mechanisms like CPX for prefill, and much more. The world of AI is evolving at an overwhelming pace, which is why we need to recognize that optimizing hardware is as important as developing new ones.
Share
Share
Copy Link
Inference providers including Baseten, DeepInfra, Fireworks AI and Together AI report 4x to 10x reductions in AI inference costs using NVIDIA Blackwell with open source models. The dramatic improvements stem from combining Blackwell's hardware capabilities with optimized software stacks and low-precision formats like NVFP4, transforming economics across healthcare, gaming and customer service applications.
AI inference costs have dropped by up to 10x as leading inference providers deploy NVIDIA Blackwell platform with open source models, according to production data released by NVIDIA. Baseten, DeepInfra, Fireworks AI, and Together AI report cost reductions ranging from 4x to 10x compared with the previous NVIDIA Hopper platform, transforming the economics of scaling AI applications across industries
1
. The improvements address a critical business challenge: whether companies can afford to scale AI interactions as demand grows. Tokenomics—the cost per token—has become the determining factor in whether AI deployments remain viable at scale.The cost reductions emerge from combining three elements rather than hardware alone. NVIDIA Blackwell delivers baseline performance improvements, but reaching 4x to 10x reductions requires pairing the platform with optimized software stacks and switching to open source models that now match frontier-level intelligence
2
. "Performance is what drives down the cost of inference," Dion Harris, senior director of HPC and AI hyperscale solutions at NVIDIA, told VentureBeat. "What we're seeing in inference is that throughput literally translates into real dollar value and driving down the cost."2

Source: VentureBeat
Sully.ai cut healthcare AI inference costs by 90%—representing a 10x reduction—while improving response times by 65% for critical workflows like generating medical notes
1
. The company develops AI employees that handle routine tasks like medical coding and note-taking, returning time to physicians previously lost to data entry. Sully.ai switched from proprietary models to open source models running on Baseten's Model API, which deploys models like gpt-oss-120b on NVIDIA Blackwell GPUs2
. Baseten used the low-precision NVFP4 data format, the TensorRT-LLM library and the NVIDIA Dynamo inference framework to deliver optimized inference. The platform has now returned over 30 million minutes to physicians, time previously consumed by manual tasks1
.Latitude reduced gaming inference costs by 4x for its AI Dungeon platform and upcoming Voyage role-playing game by running large Mixture-of-Experts models on DeepInfra's Blackwell-powered infrastructure
2
. Cost per million tokens dropped from 20 cents on NVIDIA Hopper to 10 cents on Blackwell, then to just 5 cents after adopting Blackwell's native NVFP4 low-precision format1
. Hardware improvements alone delivered 2x gains, but reaching 4x required the precision format change2
. Every player action in Latitude's platform triggers an inference request, making cost savings essential as engagement scales.Decagon achieved 6x cost reduction per query for AI-powered voice customer support by running its multimodel stack on Together AI's Blackwell infrastructure, maintaining response times under 400 milliseconds even when processing thousands of tokens per query
2
. Low latency proves critical for voice interactions where delays cause users to disconnect or lose trust. Sentient Foundation reported 25% to 50% better cost efficiency for agentic chat platforms using Fireworks AI's Blackwell-optimized inference stack, processing 5.6 million queries in a single week during its viral launch2
.Related Stories
NVIDIA's extreme co-design approach combines hardware architecture with software optimization to achieve infrastructure efficiency gains
3
. The GB200 NVL72 configuration uses 72 chips coupled with 30TB of fast shared memory to optimize expert parallelism in Mixture-of-Experts architectures, ensuring token batches split and scatter across GPUs efficiently3
. MoE models activate different specialized sub-models based on input, benefiting from Blackwell's NVLink fabric that enables rapid communication between experts2
.
Source: NVIDIA
Precision formats show the clearest impact on cost savings. NVFP4 reduces the number of bits required to represent model weights and activations, allowing more computation per GPU cycle while maintaining accuracy
2
. The software stack integration with tools like TensorRT-LLM and Dynamo creates additional performance improvements beyond hardware alone. Recent MIT research found that infrastructure and algorithmic efficiencies are reducing inference costs for frontier-level performance by up to 10x annually1
. Businesses now face counterintuitive economics: reducing AI inference costs requires investing in higher-performance infrastructure because throughput improvements translate directly into lower per-token costs. As enterprises scale AI from pilot projects to millions of users, these cost reductions determine which applications remain economically viable.Summarized by
Navi
[1]
[2]
29 Aug 2024

03 Apr 2025•Technology

11 Oct 2025•Technology

1
Technology

2
Science and Research

3
Policy and Regulation
