2 Sources
[1]
How NVIDIA's Inference Software Stack Powers the Lowest Token Cost
Baseten, Cognition, Deep Infra, Together AI and Cursor are seeing compounding value from NVIDIA's software and open source ecosystem. As organizations move from AI pilots to production AI factories, infrastructure decisions have shifted from peak chip specifications to cost per token: how many useful tokens they can deliver per dollar, per watt and within required latency targets. Codesigned with NVIDIA GPUs, CPUs, networking and systems, and strengthened by a broad open source ecosystem, NVIDIA's full-stack inference software continuously improves hardware performance. On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month. Leading companies and inference providers are already seeing the compounding value of NVIDIA's inference software stack on Blackwell: * Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second. * Cognition is using the NVIDIA Dynamo inference framework to manage inference GPUs, giving its team a ready-made path to scale reinforcement learning workloads without needing to build that infrastructure from scratch. * Deep Infra uses the NVIDIA inference software stack to serve frontier open source models performantly on Blackwell from day zero, including DeepSeek V4. * Together AI used NVIDIA TensorRT-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for its real-time coding experience. Why Software Matters for Inference Economics Traditional web, search and software-as-a-service workloads were relatively predictable: A user might load a page, refresh a feed or update a business record. These requests typically followed similar software paths, reading from or writing to a database, and scaled by adding more of the same servers. Agentic AI is different. Agents can reason, plan, call tools, spin up specialist subagents and manage massive context across multi-turn workflows. They turn a single request into a distributed computing problem that can span hundreds of subagents, thousands of tasks and multiple large language models, running across GPUs, CPUs, DPUs and storage systems. The software stack determines whether that complexity turns into wasted capacity or lower cost per token. Lower cost per token comes from turning individual optimizations into system-level performance. NVIDIA's inference software stack does this by connecting three layers: * Production Operation: Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources. * Application Acceleration: Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion. * Infrastructure Access: Exposes NVIDIA GPU, networking, memory and system capabilities without requiring developers to manage every device instruction set or data-transfer protocol directly. When these layers work as one system, individual optimizations compound. Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x. The chart below shows the result. Capturing that gain in production is complex, requiring coordination across the full inference stack -- from production operations and model runtimes to kernels, communication libraries and hardware access. NVIDIA's inference software stack is designed to make those layers work together so each optimization can build on the others. Open Source Amplifies the Full-Stack Advantage That same full-stack foundation is amplified by the open source ecosystem. Many of today's most widely used open source AI frameworks and inference projects are built natively on NVIDIA CUDA, which means new research and software optimizations run with leading performance on NVIDIA GPUs from day zero. PyTorch is a leading example. Launched in 2016 with native CUDA support, PyTorch has coevolved with NVIDIA's architecture, giving developers access to innovations such as Tensor Cores, Transformer Engine and NVFP4 directly through a familiar framework. When breakthroughs such as DFlash speculative decode, which delivers up to 15x more throughput on existing hardware, or FastVideo, which generates 1080p videos in less than five seconds, land in PyTorch, they can run instantly on NVIDIA, helping AI factories convert research progress into lower token costs. The same open source momentum is why when a new frontier open model like DeepSeek V4 is released, leading inference frameworks like vLLM and SGLang have day-zero deployment recipes for the NVIDIA Blackwell architecture -- making the model accessible across millions of Blackwell GPUs. It's also why DeepSeek V4 performance on Blackwell improved by up to 5x within about a month across vLLM and SGLang frameworks, cutting token costs to roughly one-fifth of previous levels. That's the open source flywheel: more developers optimize CUDA-native inference paths, more production deployments feed back into the ecosystem and each software improvement increases delivered token output while lowering cost per token over time. Explore how software multiplies hardware performance in this NVIDIA AI Podcast on tokenomics and this inference solutions page.
[2]
NVIDIA Slashes DeepSeek v4 Token Costs By Up To 5x Just One Month After Launch, Through Pure Blackwell Software Tuning
NVIDIA Blackwell GPUs continue to see massive optimizations, leading to a 5x drop in token cost in DeepSeek v4 AI models. NVIDIA Cost Per Token Narrative Sees Massive Gain In DeepSeek V4 As AI Model Sees 5x Boost On Blackwell GPUs With Continued Optimizations "Cost Per Token" is the fundamental metric for AI TCO, as NVIDIA highlighted this a few months back, and now, the company is delivering the lowest-ever token cost in DeepSeek v4. Today, NVIDIA announced that its full-stack inference software has brought further optimizations to its hardware stack, such as Blackwell GB200 & GB300, improving their performance & making them better than ever. With the latest optimizations, NVIDIA's Blackwell platform has been able to reduce token costs by up to 5x on DeepSeek V4, just one month after the model's release. Leading companies and inference providers have already acknowledged these gains on their NVIDIA Blackwell-powered platforms: * Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second. * Cognition uses the NVIDIA Dynamo inference framework to manage inference GPUs, giving its team a ready-made path to scale reinforcement learning workloads without needing to build that infrastructure from scratch. * Deep Infra uses the NVIDIA inference software stack to serve frontier open-source models performantly on Blackwell from day zero, including DeepSeek V4. * Together AI used NVIDIA TensorRT-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for its real-time coding experience. The lower token costs come from turning individual optimizations into system-level performance on NVIDIA GPUs. NVIDIA explains that its inference software stacks achieve these gains by connecting three layers: * Production Operation: Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources. * Application Acceleration: Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion. * Infrastructure Access: Exposes NVIDIA GPU, networking, memory, and system capabilities without requiring developers to manage every device instruction set or data-transfer protocol directly. These layers are all assembled in the complete systems, which compounds the optimization. On the other hand, NVIDIA's NVLink, NVFP4, Multi-Token-Prediction, and other technologies also offer meaningful gains, offering a combined 20x throughput increase. NVIDIA's Blackwell GPUs, powered by continuous full-stack inference optimizations, have achieved a groundbreaking 5x reduction in cost per token for DeepSeek V4 just one month after its release, reinforcing cost per token as the key metric for AI total cost of ownership. Through seamless integration of production operations, application acceleration, & infrastructure access, along with technologies like NVLink and NVFP4, Blackwell delivers compounded system-level gains, resulting in up to 20x higher throughput. Leading inference providers, including Baseten, Cognition, Deep Infra, and Together AI, are already leveraging these advancements to deliver superior performance for reasoning, coding, and large-scale workloads, further solidifying NVIDIA's dominance in efficient AI inference. Follow Wccftech on Google to get more of our news coverage in your feeds.
Share
Copy Link
NVIDIA has achieved a 5x reduction in token costs for DeepSeek v4 on its Blackwell platform within just one month of the model's release. Leading AI companies including Baseten, Cognition, Deep Infra, and Together AI are already leveraging these full-stack inference software improvements to deliver superior performance across reasoning, coding, and large-scale workloads.
NVIDIA has achieved a dramatic 5x token cost reduction for the DeepSeek v4 model on its Blackwell platform in just one month through continuous full-stack inference software improvements
1
2
. The breakthrough highlights how software optimization has become as critical as hardware specifications in determining AI total cost of ownership, shifting infrastructure decisions from peak chip performance to cost per token metrics that measure useful tokens delivered per dollar and watt.
Source: Wccftech
Leading inference providers are already seeing compounding value from these optimizations. Baseten used the NVIDIA TensorRT-LLM open source library to serve DeepSeek v4 Pro on Blackwell GPUs, applying proprietary runtime optimizations to deliver up to 50% more tokens per second for reasoning, coding, and long-context workloads
1
. Together AI leveraged TensorRT-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for real-time coding experiences.The token cost reduction stems from NVIDIA's three-layer architecture that connects production operations, application acceleration, and infrastructure access into a unified system
2
. Production operations coordinate distributed serving, orchestration, autoscaling, and memory management across compute and storage resources. Application acceleration runs models with high performance while giving developers room to customize using runtime optimizations like overlapping compute and communication and kernel fusion. Infrastructure access exposes GPU, networking, memory, and system capabilities without requiring developers to manage device instruction sets directly.
Source: NVIDIA
When these layers work together, individual optimizations compound dramatically. Technologies like disaggregated serving, large expert parallelism over NVLink interconnect, NVFP4 precision, and multi-token prediction each deliver meaningful gains independently, but combined they increase throughput gains by up to 20x
1
2
. This matters because agentic AI inference workloads differ fundamentally from traditional software-as-a-service applications, turning single requests into distributed computing problems spanning hundreds of subagents and thousands of tasks.Related Stories
NVIDIA's full-stack advantage extends through its open source ecosystem built natively on CUDA. PyTorch, launched in 2016 with native CUDA support, has coevolved with NVIDIA architecture to give developers direct access to innovations like Tensor Cores, Transformer Engine, and NVFP4
1
. When breakthroughs like DFlash speculative decode, which delivers up to 15x more throughput on existing hardware, land in PyTorch, they run instantly on NVIDIA GPUs, helping AI production environments convert research progress into lower operating costs.Cognition is using the NVIDIA Dynamo inference framework to manage inference GPUs, providing a ready-made path to scale reinforcement learning workloads without building infrastructure from scratch
2
. Deep Infra uses the NVIDIA inference software stack to serve frontier open source models performantly on Blackwell from day zero, including DeepSeeK v4. The GB200 and GB300 systems continue to see massive optimizations that compound over time, suggesting organizations should watch for further cost per token improvements as the software stack matures and new optimization techniques emerge from the research community.Summarized by
Navi
12 Feb 2026•Technology

24 May 2026•Technology

29 Aug 2024

1
Policy and Regulation

2
Technology

3
Science and Research
