NVIDIA Blackwell Ultra Cuts AI Inference Costs 35x

NVIDIA Blackwell Ultra Delivers Breakthrough Performance for Agentic AI

NVIDIA has released new performance data showing that its GB300 NVL72 systems equipped with Blackwell Ultra GPUs achieve up to 50x higher throughput per megawatt and 35x lower cost per token compared to the NVIDIA Hopper platform for low-latency workloads 1

. These dramatic efficiency gains target agentic AI applications and AI coding assistants, which have driven explosive growth in software-programming-related AI queries from 11% to approximately 50% last year, according to OpenRouter's State of Inference report 1

Source: Wccftech

The performance improvements stem from extreme hardware-software codesign that addresses transformer attention layer bottlenecks. Blackwell Ultra Tensor Cores provide 1.5x greater compute performance than standard NVIDIA Blackwell GPUs, while the architecture doubles attention-layer processing through accelerated softmax execution 4

. Cloud providers including Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying GB300 NVL72 systems in production for low-latency and long-context workloads such as agentic coding 1

Software Optimizations Drive Continuous Performance Gains

Continuous software optimizations from NVIDIA's TensorRT-LLM, NVIDIA Dynamo, Mooncake, and SGLang teams have significantly boosted Blackwell NVL72 throughput for Mixture-of-Experts (MoE) inference across all latency targets 1

. The TensorRT-LLM library improvements alone have delivered up to 5x better performance on GB200 for low-latency workloads compared with just four months ago 1

Key software optimizations include higher-performance GPU kernels optimized for efficiency and low latency, NVLink Symmetric Memory enabling direct GPU-to-GPU memory access, and programmatic dependent launch that minimizes idle time 1

. SemiAnalysis benchmarks documented that throughput per GPU doubled at certain interactivity levels since October 2025, with NVIDIA stating these developments deliver a 10x increase in tokens per second per user and a 5x improvement in tokens per second per megawatt relative to Hopper 4

AI Inference Costs Drop Up to 10x with Open-Source Models

Leading inference providers including Baseten, DeepInfra, Fireworks AI, and Together AI are reducing AI inference costs by up to 10x using open-source models on the NVIDIA Blackwell platform 2

. Production deployment data shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users 3

Source: VentureBeat

The 4x to 10x cost reductions required combining Blackwell hardware with optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence 3

. Hardware improvements alone delivered 2x gains in some deployments, but reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed-source APIs that charge premium rates 3

Real-World Deployments Show Dramatic Cost Reductions

Sully.ai cut healthcare AI inference costs by 90%, representing a 10x reduction, while improving response times by 65% for critical workflows like generating medical notes by switching from proprietary models to open-source models running on Baseten's Blackwell-powered platform 2

. The company has returned over 30 million minutes to physicians, time previously lost to data entry and manual tasks 2

Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large MoE models on DeepInfra's Blackwell deployment 2

. Cost per million tokens dropped from 20 cents on the NVIDIA Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell's native NVFP4 low-precision format 2

. Sentient Foundation achieved 25% to 50% better cost efficiency using Fireworks AI's Blackwell-optimized inference stack, processing 5.6 million queries in a single week during its viral launch 3

GB300 Delivers Superior Economics for Long-Context Workloads

For long-context workloads with 128,000-token inputs and 8,000-token outputs, such as AI coding assistants reasoning across entire codebases, GB300 NVL72 delivers up to 1.5x lower cost per token compared with GB200 NVL72 1

. Blackwell Ultra's 1.5x higher NVFP4 compute performance and 2x faster attention processing enable agents to efficiently understand entire codebases 1

Chen Goldberg, senior vice president of engineering at CoreWeave, stated: "As inference moves to the center of AI production, long-context performance and token efficiency become critical. Grace Blackwell NVL72 addresses that challenge directly" 1

. CoreWeave was the first AI cloud provider to deploy GB300 NVL72 systems in production 4

. Microsoft subsequently deployed what it describes as the world's first large-scale GB300 NVL72 supercomputing cluster, with testing validated by Signal65 recording the cluster achieving over 1.1 million tokens per second on a single rack 4

Source: NVIDIA

Architecture Innovations Enable Massive Efficiency Gains

Blackwell Ultra has expanded to a 72-GPU configuration, joining them into a single unified NVLink fabric with 130 TB/s of connectivity 5

. Compared to Hopper, which is confined to an 8-chip NVLink design, NVIDIA has brought superior architecture, rack design, and the NVFP4 precision format, which explains why GB300 dominates in throughput 5

"Performance is what drives down the cost of inference," said Dion Harris, senior director of HPC and AI hyperscaler solutions at NVIDIA. "What we're seeing in inference is that throughput literally translates into real dollar value and driving down the cost" 3

. Oracle's OCI platform is deploying GB300 NVL72 systems with plans to scale Superclusters beyond 100,000 Blackwell GPUs to support inference workload demand 4

. NVIDIA has previewed its next-generation Rubin platform, projecting a 10x performance improvement over Blackwell 4

NVIDIA Blackwell Ultra slashes AI inference costs by 35x while delivering 50x better performance

NVIDIA Blackwell Ultra Delivers Breakthrough Performance for Agentic AI

Software Optimizations Drive Continuous Performance Gains

AI Inference Costs Drop Up to 10x with Open-Source Models

Real-World Deployments Show Dramatic Cost Reductions

GB300 Delivers Superior Economics for Long-Context Workloads

Architecture Innovations Enable Massive Efficiency Gains

References

New Data Shows NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Costs for Agentic AI

Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

AI inference costs dropped up to 10x on Nvidia's Blackwell -- but hardware is only half the equation

NVIDIA Blackwell Ultra delivers 50x higher efficiency for agentic AI

NVIDIA's Blackwell Ultra Pushes "Agentic AI" Performance to New Heights, Delivering Up to 50× Higher Tokens/Watt & Stronger Long-Context Workloads

Related Stories

Nvidia Unveils Blackwell Ultra B300: A Leap Forward in AI Computing

Nvidia's Blackwell Ultra GB300 Dominates MLPerf Benchmarks with Significant Performance Gains

NVIDIA's Blackwell B200 GPU Shatters AI Performance Records in MLPerf Inference Benchmark

Recent Highlights

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

Anthropic sues Pentagon over supply chain risk label after refusing autonomous weapons use

OpenAI secures $110 billion funding round as questions swirl around AI bubble and profitability

Recent Highlights

Today's Top Stories

Google Maps unveils Ask Maps chatbot and 3D navigation in biggest redesign in over a decade

Google uses AI and 5 million news reports to predict flash floods across 150 countries

Perplexity launches Personal Computer, an AI agent that runs 24/7 on your Mac mini

AI autocomplete covertly shifts human opinions on social issues, even when users ignore suggestions