NVIDIA Blackwell Ultra slashes AI inference costs by 35x while delivering 50x better performance

6 Sources

Share

NVIDIA's GB300 NVL72 systems powered by Blackwell Ultra GPUs achieve up to 50x higher throughput per megawatt and 35x lower cost per token compared to the Hopper platform. Cloud providers including Microsoft, CoreWeave, and Oracle are deploying these systems at scale for agentic AI and coding assistants, while leading inference providers report 4x to 10x cost reductions using open-source models.

NVIDIA Blackwell Ultra Delivers Breakthrough Performance for Agentic AI

NVIDIA has released new performance data showing that its GB300 NVL72 systems equipped with Blackwell Ultra GPUs achieve up to 50x higher throughput per megawatt and 35x lower cost per token compared to the NVIDIA Hopper platform for low-latency workloads

1

. These dramatic efficiency gains target agentic AI applications and AI coding assistants, which have driven explosive growth in software-programming-related AI queries from 11% to approximately 50% last year, according to OpenRouter's State of Inference report

1

.

Source: Wccftech

Source: Wccftech

The performance improvements stem from extreme hardware-software codesign that addresses transformer attention layer bottlenecks. Blackwell Ultra Tensor Cores provide 1.5x greater compute performance than standard NVIDIA Blackwell GPUs, while the architecture doubles attention-layer processing through accelerated softmax execution

4

. Cloud providers including Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying GB300 NVL72 systems in production for low-latency and long-context workloads such as agentic coding

1

.

Software Optimizations Drive Continuous Performance Gains

Continuous software optimizations from NVIDIA's TensorRT-LLM, NVIDIA Dynamo, Mooncake, and SGLang teams have significantly boosted Blackwell NVL72 throughput for Mixture-of-Experts (MoE) inference across all latency targets

1

. The TensorRT-LLM library improvements alone have delivered up to 5x better performance on GB200 for low-latency workloads compared with just four months ago

1

.

Key software optimizations include higher-performance GPU kernels optimized for efficiency and low latency, NVLink Symmetric Memory enabling direct GPU-to-GPU memory access, and programmatic dependent launch that minimizes idle time

1

. SemiAnalysis benchmarks documented that throughput per GPU doubled at certain interactivity levels since October 2025, with NVIDIA stating these developments deliver a 10x increase in tokens per second per user and a 5x improvement in tokens per second per megawatt relative to Hopper

4

.

AI Inference Costs Drop Up to 10x with Open-Source Models

Leading inference providers including Baseten, DeepInfra, Fireworks AI, and Together AI are reducing AI inference costs by up to 10x using open-source models on the NVIDIA Blackwell platform

2

. Production deployment data shows significant cost improvements across healthcare, gaming, agentic chat, and customer service as enterprises scale AI from pilot projects to millions of users

3

.

Source: VentureBeat

Source: VentureBeat

The 4x to 10x cost reductions required combining Blackwell hardware with optimized software stacks and switching from proprietary to open-source models that now match frontier-level intelligence

3

. Hardware improvements alone delivered 2x gains in some deployments, but reaching larger cost reductions required adopting low-precision formats like NVFP4 and moving away from closed-source APIs that charge premium rates

3

.

Real-World Deployments Show Dramatic Cost Reductions

Sully.ai cut healthcare AI inference costs by 90%, representing a 10x reduction, while improving response times by 65% for critical workflows like generating medical notes by switching from proprietary models to open-source models running on Baseten's Blackwell-powered platform

2

. The company has returned over 30 million minutes to physicians, time previously lost to data entry and manual tasks

2

.

Latitude reduced gaming inference costs 4x for its AI Dungeon platform by running large MoE models on DeepInfra's Blackwell deployment

2

. Cost per million tokens dropped from 20 cents on the NVIDIA Hopper platform to 10 cents on Blackwell, then to 5 cents after adopting Blackwell's native NVFP4 low-precision format

2

. Sentient Foundation achieved 25% to 50% better cost efficiency using Fireworks AI's Blackwell-optimized inference stack, processing 5.6 million queries in a single week during its viral launch

3

.

GB300 Delivers Superior Economics for Long-Context Workloads

For long-context workloads with 128,000-token inputs and 8,000-token outputs, such as AI coding assistants reasoning across entire codebases, GB300 NVL72 delivers up to 1.5x lower cost per token compared with GB200 NVL72

1

. Blackwell Ultra's 1.5x higher NVFP4 compute performance and 2x faster attention processing enable agents to efficiently understand entire codebases

1

.

Chen Goldberg, senior vice president of engineering at CoreWeave, stated: "As inference moves to the center of AI production, long-context performance and token efficiency become critical. Grace Blackwell NVL72 addresses that challenge directly"

1

. CoreWeave was the first AI cloud provider to deploy GB300 NVL72 systems in production

4

. Microsoft subsequently deployed what it describes as the world's first large-scale GB300 NVL72 supercomputing cluster, with testing validated by Signal65 recording the cluster achieving over 1.1 million tokens per second on a single rack

4

.

Source: NVIDIA

Source: NVIDIA

Architecture Innovations Enable Massive Efficiency Gains

Blackwell Ultra has expanded to a 72-GPU configuration, joining them into a single unified NVLink fabric with 130 TB/s of connectivity

5

. Compared to Hopper, which is confined to an 8-chip NVLink design, NVIDIA has brought superior architecture, rack design, and the NVFP4 precision format, which explains why GB300 dominates in throughput

5

.

"Performance is what drives down the cost of inference," said Dion Harris, senior director of HPC and AI hyperscaler solutions at NVIDIA. "What we're seeing in inference is that throughput literally translates into real dollar value and driving down the cost"

3

. Oracle's OCI platform is deploying GB300 NVL72 systems with plans to scale Superclusters beyond 100,000 Blackwell GPUs to support inference workload demand

4

. NVIDIA has previewed its next-generation Rubin platform, projecting a 10x performance improvement over Blackwell

4

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo