Google's TurboQuant slashes AI memory usage by 6x, sends memory chip stocks tumbling

Reviewed byNidhi Govil

7 Sources

Share

Google Research unveiled TurboQuant, a compression algorithm that reduces large language model memory footprint by at least 6x while delivering an 8x performance boost on Nvidia H100 GPUs. The breakthrough targets the key-value cache bottleneck without sacrificing accuracy, but memory chip stocks including Micron and Western Digital dropped as investors reconsidered future demand for AI hardware.

Google Research Tackles the AI Memory Crisis

Google Research published TurboQuant this week, a compression algorithm designed to address one of the most expensive challenges in running large language models: the escalating memory demands of the key-value cache

1

. This digital storage system holds context information so models don't have to recompute data with every token generated, but as context windows expand, the cache consumes massive amounts of GPU memory that could otherwise serve more users or run larger models

3

. TurboQuant compresses the cache to just 3 bits per value, down from the standard 16, achieving at least a 6x reduction in memory usage

2

. In benchmarks on Nvidia H100 GPUs, 4-bit TurboQuant delivered up to an 8x performance boost in computing attention logits compared to unquantized 32-bit keys

2

.

Source: Digit

Source: Digit

How TurboQuant Eliminates Compression Overhead

The innovation behind TurboQuant lies in eliminating the overhead that makes most compression techniques less effective than their headline numbers suggest

3

. Traditional vector quantization methods reduce the size of data vectors but must store additional normalization constants alongside the compressed data, typically adding one or two extra bits per number and partially undoing the compression gains

4

. TurboQuant avoids this through a two-stage process developed by research scientist Amir Zandieh and VP Vahab Mirrokni, along with collaborators at Google DeepMind, KAIST, and New York University

3

. The first stage, called PolarQuant, converts data vectors from standard Cartesian coordinates into polar coordinates, separating each vector into a radius representing magnitude and a set of angles representing direction

4

. Because the angular distributions follow predictable, concentrated patterns after a random rotation, the system skips expensive per-block normalization steps entirely

2

.

Source: Ars Technica

Source: Ars Technica

Quantized Johnson-Lindenstrauss Delivers Error Correction

The second stage applies a 1-bit error correction layer using an algorithm called Quantized Johnson-Lindenstrauss, or QJL

1

. QJL projects the residual quantization error from PolarQuant into a lower-dimensional space and reduces each value to a single sign bit, either +1 or -1, while preserving the essential vector data that describes relationships

1

. This serves as a zero-bias estimator that eliminates systematic bias in attention score calculations at negligible additional cost

2

. The result is a more accurate attention score, the fundamental process by which neural networks decide what data is important when processing queries

1

.

Benchmarks Show No Accuracy Loss Across Models

Google tested TurboQuant across five standard benchmarks for long-context language models, including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models from the Gemma, Mistral, and Llama families

2

. TurboQuant achieved perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing KV memory by at least six times

2

. On the LongBench suite, which covers question answering, code generation, and summarization, TurboQuant matched or outperformed the KIVI baseline across all tasks

2

. The algorithm requires no training or fine-tuning and incurs negligible runtime overhead, making it suitable for deployment in production inference and large-scale vector search systems

2

. Importantly, TurboQuant targets the inference memory bottleneck during KV cache compression, not the model's weights, which is a completely different compression challenge

5

.

Memory Chip Stocks React to Efficiency Breakthrough

Within hours of Google's announcement, memory chip stocks fell as investors recalculated how much physical memory the AI industry might actually need

3

. Micron dropped 3 percent, Western Digital lost 4.7 percent, and SanDisk fell 5.7 percent

3

. Wells Fargo analyst Andrew Rocha noted that TurboQuant directly attacks the cost curve for AI memory systems, quickly raising questions about actual capacity requirements

3

. However, analysts cautioned that the demand picture for AI memory remains strong, and compression algorithms have existed for years without fundamentally altering procurement volumes

3

. The paper will be presented at ICLR 2026 next month, with related work on PolarQuant appearing at AISTATS 2026

2

.

Production Deployment and Cost Implications

TurboQuant arrives as the AI industry confronts the economics of inference, where serving millions of queries per day with acceptable latency determines whether AI products are financially viable at scale

3

. The KV cache is the bottleneck that limits how many concurrent users a single GPU can serve and how long a context window a model can practically support

3

. If implemented, TurboQuant could reduce serving costs by more than 50 percent for enterprises that deploy it on their models

4

. The algorithm can quantize the cache to just 3 bits with no additional training, so it can be applied to existing models without architectural changes

1

. Mobile AI could see particular benefit, as hardware limitations on smartphones make compression techniques like TurboQuant valuable for improving output quality without sending data to the cloud

1

. The gap between clean benchmarks and production systems serving billions of requests remains to be seen, as TurboQuant was tested on open-source models rather than Google's own Gemini stack at scale

5

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo