Nvidia unveils Groq 3 LPU to accelerate AI inference with SRAM-powered architecture

Reviewed byNidhi Govil

14 Sources

Share

Nvidia introduced the Groq 3 LPU at GTC 2026, a specialized chip designed to accelerate AI inference using SRAM memory instead of traditional HBM. The chip stems from Nvidia's $20 billion acquisition of Groq's intellectual property and promises to deliver faster token generation for chatbots and AI agents through 256-chip LPX rack systems paired with Rubin GPUs.

News article

Nvidia Targets AI Inference with $20 Billion Groq Acquisition

Nvidia CEO Jensen Huang took the stage at GTC 2026 in San Jose to announce the Groq 3 LPU, a language processing unit that marks a strategic shift for the GPU giant toward specialized AI inference hardware

1

. The chip incorporates intellectual property Nvidia licensed from startup Groq last Christmas Eve for $20 billion, addressing what Huang called "the inflection point of inference"

1

. The Groq 3 LPU joins Nvidia's Vera Rubin platform alongside the Rubin GPU, Vera CPU, and networking components to form a comprehensive data center architecture

5

.

The acquisition reflects mounting pressure on Nvidia to maintain dominance as competitors like AMD close gaps and Amazon Web Services deploys alternative inference solutions combining Trainium accelerators with Cerebras wafer-scale systems

4

. Ian Buck, Nvidia's VP of Hyperscale and HPC, explained the urgency: "We've pulled CPX" to focus resources on optimizing decode performance with the LPU this year

2

.

SRAM Memory Architecture Delivers Extreme Bandwidth

The Groq 3 LPU distinguishes itself through an SRAM-based architecture that prioritizes memory bandwidth over capacity. Each chip contains 500 MB of SRAM memory delivering 150 TB/s of bandwidth—seven times faster than the 22 TB/s offered by Rubin GPU's 288 GB of HBM4 memory

1

. The chip achieves 1.2 petaFLOPS of FP8 compute, though support for 4-bit block floating point data types won't arrive until the LP35 generation next year

4

.

Groq's approach interleaves processing units with memory units directly on the chip, eliminating the need for data to travel off-chip to HBM and back. "The data actually flows directly through the SRAM," said Mark Heaps, formerly Groq's chief technology evangelist and now Nvidia's director of developer marketing

1

. This linear data flow enables the extreme low-latency token generation required for interactive AI chatbot performance and reasoning models that run inference many times before users see output

1

.

Nvidia will deploy these chips in LPX rack systems containing 256 LPUs spread across 32 compute trays, each with eight LPUs plus fabric expansion logic, DRAM, a host CPU, and a BlueField-4 data processing unit

4

. A complete rack offers 128 GB of on-chip SRAM with 640 TB/s of scale-up bandwidth

3

.

Hybrid Decode Architecture Splits Workloads Between LPU and GPU

Nvidia's strategy pairs LPX rack systems with Vera Rubin NVL72 server units to enable what Buck calls disaggregated inference. The decode phase splits between LPU and GPU on a layer-by-layer basis, with computations benefiting from fast SRAM running on the LPU while attention math, softmax, routing, and KV cache calculations execute on GPUs

2

. "We can focus and run the computations that benefit from the fast SRAM of the LPU over here in one layer, and literally the next layer, we can send the intermediate activation state over to the GPUs," Buck explained

2

.

This hybrid approach allows only the LPUs to store model weights while per-query state and KV cache data—which can grow quite large—remain in GPU HBM

2

. The combined system promises up to 35x throughput increases when running large language models reaching 1 trillion parameters, according to Nvidia benchmarks

3

. Buck positioned this capability for multi-agent systems requiring interactive performance while inferencing trillion-parameter models with context windows of millions of tokens

5

.

Time-to-Market Drove Acquisition Over Internal Development

The Groq 3 chip is based on Groq's second-generation LPU technology with last-minute tweaks before manufacturing at Samsung fabs

4

. Notably, it lacks NVLink interconnect, NVFP4 hardware support, and CUDA compatibility at launch—indicating Nvidia prioritized speed over deep integration

4

. The $20 billion represented an opportunity cost to ship products this year rather than build from scratch

4

.

Huang indicated the chip will ship in Q3, with one analyst projecting 4 to 5 million LPU shipments through 2026 and 2027

3

. The systems target major AI companies including OpenAI, Anthropic, and Meta, potentially powering chatbot queries and image generation requests

3

. Huang suggested high-performance, low-latency inference providers could eventually charge as much as $150 per million tokens for this capability

4

.

The move addresses a fundamental shift in AI economics as computational load transitions from building larger models to deploying them at scale. D-Matrix CEO Sid Sheth noted that "winning systems will combine different types of silicon and fit easily into existing data centers alongside GPUs"

1

. For AI agents communicating with other AIs rather than humans, Buck envisions moving from 100 tokens per second to 1,500 TPS or more

5

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo