7 Sources
7 Sources
[1]
Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x
Even if you don't know much about the inner workings of generative AI models, you probably know they need a lot of memory. Hence, it is currently almost impossible to buy a measly stick of RAM without getting fleeced. Google Research recently revealed TurboQuant, a compression algorithm that reduces the memory footprint of large language models (LLMs) while also boosting speed and maintaining accuracy. TurboQuant is aimed at reducing the size of the key-value cache, which Google likens to a "digital cheat sheet" that stores important information so it doesn't have to be recomputed. This cheat sheet is necessary because, as we say all the time, LLMs don't actually know anything; they can do a good impression of knowing things through the use of vectors, which map the semantic meaning of tokenized text. When two vectors are similar, that means they have conceptual similarity. High-dimensional vectors, which can have hundreds or thousands of embeddings, may describe complex information like the pixels in an image or a large data set. They also occupy a lot of memory and inflate the size of the key-value cache, bottlenecking performance. To make models smaller and more efficient, developers employ quantization techniques to run them at lower precision. The drawback is that the outputs get worse -- the quality of token estimation goes down. With TurboQuant, Google's early results show an 8x performance increase and 6x reduction in memory usage in some tests without a loss of quality. Angles and errors Applying TurboQuant to an AI model is a two-step process. To achieve high-quality compression, Google has devised a system called PolarQuant. Usually, vectors in AI models are encoded using standard XYZ coordinates, but PolarQuant converts vectors into polar coordinates in a Cartesian system. On this circular grid, the vectors are reduced to two pieces of information: a radius (core data strength) and a direction (the data's meaning). Google offers an interesting real-world analogy to explain this process. The vector coordinates are like directions, so the traditional encoding might be "Go 3 blocks East, 4 blocks North." But using Cartesian coordinates, it's simply "Go 5 blocks at 37-degrees." This takes up less space and saves the system from performing expensive data normalization steps. PolarQuant is doing most of the compression, but the second step cleans up the rough spots. While PolarQuant is effective, it can create residual errors. Google proposes smoothing that out with a technique called Quantized Johnson-Lindenstrauss (QJL). This applies a 1-bit error-correction layer to the model, reducing each vector to a single bit (+1 or -1) while preserving the essential vector data that describes relationships. The result is a more accurate attention score -- that's the fundamental process by which neural networks decide what data is important. So does all this math work? Google says it tested the new algorithmic compression across a suite of long-context benchmarks using both Gemma and Mistral open models. TurboQuant apparently had perfect downstream results in all tests while reducing memory usage in the key-value cache by 6x. The algorithm can quantize the cache to just 3 bits with no additional training, so it can be applied to existing models. Computing the attention score with 4-bit TurboQuant is also 8x faster compared to 32-bit unquantized keys on Nvidia H100 accelerators. If implemented, TurboQuant could make AI models less expensive to run and less hungry for memory. However, the companies creating this technology could also use the newly freed-up memory to run more complex models. It'll probably be a mix of both, but mobile AI could see more benefit. With the hardware limitations of a smartphone, compression techniques like TurboQuant could improve the quality of outputs without sending your data to the cloud.
[2]
Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times -- up to 8x performance boost on Nvidia H100 GPUs, compresses KV caches to 3 bits with no accuracy loss
The algorithm achieves up to an eight-times performance boost over unquantized keys on Nvidia H100 GPUs. Google Research published TurboQuant on Tuesday, a training-free compression algorithm that quantizes LLM KV caches down to 3 bits without any loss in model accuracy. In benchmarks on Nvidia H100 GPUs, 4-bit TurboQuant delivered up to an eight-times performance increase in computing attention logits compared to unquantized 32-bit keys, while reducing KV cache memory by at least six times. KV caches store previously computed attention data so that LLMs don't have to recompute it at each token generation step. These caches are becoming major memory bottlenecks as context windows grow larger, and while traditional vector quantization methods can reduce the size of these caches, they introduce a small memory overhead of a few extra bits per value from the quantization constants that must be stored alongside the compressed data. That sounds small, but they're compounding alongside larger context windows. TurboQuant eliminates that overhead via a two-stage process. The first uses a technique called PolarQuant, which converts data vectors from standard Cartesian coordinates into polar coordinates. This separates each vector into a radius (representing magnitude) and a set of angles (representing direction). Because the angular distributions are predictable and concentrated, PolarQuant skips the expensive per-block normalization step that conventional quantizers require. This leads to high-quality compression with zero overhead from stored quantization constants. The second stage applies a 1-bit error correction layer using an algorithm called Quantized Johnson-Lindenstrauss (QJL). QJL projects the residual quantization error into a lower-dimensional space and reduces each value to a single sign bit, eliminating systematic bias in attention score calculations at negligible additional cost. Google tested all three algorithms across long-context benchmarks, including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models Gemma and Mistral. TurboQuant achieved perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing KV memory by at least six times. On the LongBench suite, which covers question answering, code generation, and summarization, TurboQuant matched or outperformed the KIVI baseline across all tasks. The algorithm also showed strong results in vector search. Evaluated against Product Quantization and RabbiQ on the GloVe dataset, TurboQuant achieved the highest 1@k recall ratios despite those baselines relying on larger codebooks and dataset-specific tuning. Google noted that TurboQuant requires no training or fine-tuning and incurs negligible runtime overhead, making it suitable for deployment in production inference and large-scale vector search systems. The paper, co-authored by research scientist Amir Zandieh and VP Vahab Mirrokni, will be presented at ICLR 2026 next month. Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.
[3]
Google's TurboQuant compresses AI memory by 6x, rattles chip stocks
Google published a research blog post on Tuesday about a new compression algorithm for AI models. Within hours, memory stocks were falling. Micron dropped 3 per cent, Western Digital lost 4.7 per cent, and SanDisk fell 5.7 per cent, as investors recalculated how much physical memory the AI industry might actually need. The algorithm is called TurboQuant, and it addresses one of the most expensive bottlenecks in running large language models: the key-value cache, a high-speed data store that holds context information so the model does not have to recompute it with every new token it generates. As models process longer inputs, the cache grows rapidly, consuming GPU memory that could otherwise be used to serve more users or run larger models. TurboQuant compresses the cache to just 3 bits per value, down from the standard 16, reducing its memory footprint by at least six times without, according to Google's benchmarks, any measurable loss in accuracy. The paper, which will be presented at ICLR 2026, was authored by Amir Zandieh, a research scientist at Google, and Vahab Mirrokni, a vice president and Google Fellow, along with collaborators at Google DeepMind, KAIST, and New York University. It builds on two earlier papers from the same group: QJL, published at AAAI 2025, and PolarQuant, which will appear at AISTATS 2026. TurboQuant's core innovation is eliminating the overhead that makes most compression techniques less effective than their headline numbers suggest. Traditional quantization methods reduce the size of data vectors but must store additional constants, normalization values that the system needs in order to decompress the data accurately. These constants typically add one or two extra bits per number, partially undoing the compression. TurboQuant avoids this through a two-stage process. The first stage, called PolarQuant, converts data vectors from standard Cartesian coordinates into polar coordinates, separating each vector into a magnitude and a set of angles. Because the angular distributions follow predictable, concentrated patterns, the system can skip the expensive per-block normalization step entirely. The second stage applies QJL, a technique based on the Johnson-Lindenstrauss transform, which reduces the small residual error from the first stage to a single sign bit per dimension. The combined result is a representation that uses most of its compression budget on capturing the original data's meaning and a minimal residual budget on error correction, with no overhead wasted on normalization constants. Google tested TurboQuant across five standard benchmarks for long-context language models, including LongBench, Needle in a Haystack, and ZeroSCROLLS, using open-source models from the Gemma, Mistral, and Llama families. At 3 bits, TurboQuant matched or outperformed KIVI, the current standard baseline for key-value cache quantization, which was published at ICML 2024. On needle-in-a-haystack retrieval tasks, which test whether a model can locate a single piece of information buried in a long passage, TurboQuant achieved perfect scores while compressing the cache by a factor of six. At 4-bit precision, the algorithm delivered up to an eight-times speedup in computing attention on Nvidia H100 GPUs compared to the uncompressed 32-bit baseline. The stock reaction was swift and, in the view of several analysts, disproportionate. Wells Fargo analyst Andrew Rocha noted that TurboQuant directly attacks the cost curve for memory in AI systems. If adopted broadly, he said, it quickly raises the question of how much memory capacity the industry actually needs. But Rocha and others also cautioned that the demand picture for AI memory remains strong, and that compression algorithms have existed for years without fundamentally altering procurement volumes. The concern is not unfounded, however. AI infrastructure spending is growing at extraordinary rates, with Meta alone committing up to $27 billion in a recent deal with Nebius for dedicated compute capacity, and Google, Microsoft, and Amazon collectively planning hundreds of billions in capital expenditure on data centres through 2026. A technology that reduces memory requirements by six times does not reduce spending by six times, because memory is only one component of a data centre's cost. But it changes the ratio, and in an industry spending at this scale, even marginal efficiency gains compound quickly. TurboQuant arrives at a moment when the AI industry is being forced to confront the economics of inference. Training a model is a one-time cost, however enormous. Running it, serving millions of queries per day with acceptable latency and accuracy, is the recurring expense that determines whether AI products are financially viable at scale. The key-value cache is central to this calculation: it is the bottleneck that limits how many concurrent users a single GPU can serve and how long a context window a model can practically support. Compression techniques like TurboQuant are part of a broader push toward making inference cheaper, alongside hardware improvements such as Nvidia's Vera Rubin architecture and Google's own Ironwood TPUs. The question is whether these efficiency gains will reduce the total amount of hardware the industry buys, or whether they will simply enable more ambitious deployments at roughly the same cost. The history of computing suggests the latter: when storage gets cheaper, people store more; when bandwidth increases, applications consume it. For Google, TurboQuant also has a direct commercial application beyond language models. The blog post notes that the algorithm improves vector search, the technology that powers semantic similarity lookups across billions of items. Google tested it against existing methods on the GloVe benchmark dataset and found it achieved superior recall ratios without requiring the large codebooks or dataset-specific tuning that competing approaches demand. This matters because vector search underpins everything from Google Search to YouTube recommendations to advertising targeting, which is to say, it underpins Google's revenue. The paper's contribution is real: a training-free compression method that achieves measurably better results than the existing state of the art, with strong theoretical foundations and practical implementation on production hardware. Whether it reshapes the economics of AI infrastructure or simply becomes one more optimisation absorbed into the industry's insatiable appetite for compute is a question the market will answer over months, not hours.
[4]
Google's new TurboQuant algorithm speeds up AI memory 8x, cutting costs by 50% or more
As Large Language Models (LLMs) expand their context windows to process massive documents and intricate conversations, they encounter a brutal hardware reality known as the "Key-Value (KV) cache bottleneck." Every word a model processes must be stored as a high-dimensional vector in high-speed memory. For long-form tasks, this "digital cheat sheet" swells rapidly, devouring the graphics processing unit (GPU) video random access memory (VRAM) system used during inference, and slowing the model performance down rapidly over time. But have no fear, Google Research is here: yesterday, the unit within the search giant released its TurboQuant algorithm suite -- a software-only breakthrough that provides the mathematical blueprint for extreme KV cache compression, enabling a 6x reduction on average in the amount of KV memory a given model uses, and 8x performance increase in computing attention logits, which could reduce costs for enterprises that implement it on their models by more than 50%. The theoretically grounded algorithms and associated research papers are available now publicly for free, including for enterprise usage, offering a training-free solution to reduce model size without sacrificing intelligence. The arrival of TurboQuant is the culmination of a multi-year research arc that began in 2024. While the underlying mathematical frameworks -- including PolarQuant and Quantized Johnson-Lindenstrauss (QJL) -- were documented in early 2025, their formal unveiling today marks a transition from academic theory to large-scale production reality. The timing is strategic, coinciding with the upcoming presentations of these findings at the upcoming conferences International Conference on Learning Representations (ICLR 2026) in Rio de Janeiro, Brazil, and Annual Conference on Artificial Intelligence and Statistics (AISTATS 2026) in Tangier, Morocco. By releasing these methodologies under an open research framework, Google is providing the essential "plumbing" for the burgeoning "Agentic AI" era: the need for massive, efficient, and searchable vectorized memory that can finally run on the hardware users already own. Already, it is believed to have an effect on the stock market, lowering the price of memory providers as traders look to the release as a sign that less memory will be needed (perhaps incorrect, given Jevons' Paradox). To understand why TurboQuant matters, one must first understand the "memory tax" of modern AI. Traditional vector quantization has historically been a "leaky" process. When high-precision decimals are compressed into simple integers, the resulting "quantization error" accumulates, eventually causing models to hallucinate or lose semantic coherence. Furthermore, most existing methods require "quantization constants" -- meta-data stored alongside the compressed bits to tell the model how to decompress them. In many cases, these constants add so much overhead -- sometimes 1 to 2 bits per number -- that they negate the gains of compression entirely. TurboQuant resolves this paradox through a two-stage mathematical shield. The first stage utilizes PolarQuant, which reimagines how we map high-dimensional space. Rather than using standard Cartesian coordinates (X, Y, Z), PolarQuant converts vectors into polar coordinates consisting of a radius and a set of angles. The breakthrough lies in the geometry: after a random rotation, the distribution of these angles becomes highly predictable and concentrated. Because the "shape" of the data is now known, the system no longer needs to store expensive normalization constants for every data block. It simply maps the data onto a fixed, circular grid, eliminating the overhead that traditional methods must carry. The second stage acts as a mathematical error-checker. Even with the efficiency of PolarQuant, a residual amount of error remains. TurboQuant applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to this leftover data. By reducing each error number to a simple sign bit (+1 or -1), QJL serves as a zero-bias estimator. This ensures that when the model calculates an "attention score" -- the vital process of deciding which words in a prompt are most relevant -- the compressed version remains statistically identical to the high-precision original. The true test of any compression algorithm is the "Needle-in-a-Haystack" benchmark, which evaluates whether an AI can find a single specific sentence hidden within 100,000 words. In testing across open-source models like Llama-3.1-8B and Mistral-7B, TurboQuant achieved perfect recall scores, mirroring the performance of uncompressed models while reducing the KV cache memory footprint by a factor of at least 6x. This "quality neutrality" is rare in the world of extreme quantization, where 3-bit systems usually suffer from significant logic degradation. Beyond chatbots, TurboQuant is transformative for high-dimensional search. Modern search engines increasingly rely on "semantic search," comparing the meanings of billions of vectors rather than just matching keywords. TurboQuant consistently achieves superior recall ratios compared to existing state-of-the-art methods like RabbiQ and Product Quantization (PQ), all while requiring virtually zero indexing time. This makes it an ideal candidate for real-time applications where data is constantly being added to a database and must be searchable immediately. Furthermore, on hardware like NVIDIA H100 accelerators, TurboQuant's 4-bit implementation achieved an 8x performance boost in computing attention logs, a critical speedup for real-world deployments. The reaction on X, obtained via a Grok search, included a mixture of technical awe and immediate practical experimentation. The original announcement from @GoogleResearch generated massive engagement, with over 7.7 million views, signaling that the industry was hungry for a solution to the memory crisis. Within 24 hours of the release, community members began porting the algorithm to popular local AI libraries like MLX for Apple Silicon and llama.cpp. Technical analyst @Prince_Canuma shared one of the most compelling early benchmarks, implementing TurboQuant in MLX to test the Qwen3.5-35B model. Across context lengths ranging from 8.5K to 64K tokens, he reported a 100% exact match at every quantization level, noting that 2.5-bit TurboQuant reduced the KV cache by nearly 5x with zero accuracy loss. This real-world validation echoed Google's internal research, proving that the algorithm's benefits translate seamlessly to third-party models. Other users focused on the democratization of high-performance AI. @NoahEpstein_ provided a plain-English breakdown, arguing that TurboQuant significantly narrows the gap between free local AI and expensive cloud subscriptions. He noted that models running locally on consumer hardware like a Mac Mini "just got dramatically better," enabling 100,000-token conversations without the typical quality degradation. Similarly, @PrajwalTomar_ highlighted the security and speed benefits of running "insane AI models locally for free," expressing "huge respect" for Google's decision to share the research rather than keeping it proprietary. The release of TurboQuant has already begun to ripple through the broader tech economy. Following the announcement on Tuesday, analysts observed a downward trend in the stock prices of major memory suppliers, including Micron and Western Digital. The market's reaction reflects a realization that if AI giants can compress their memory requirements by a factor of six through software alone, the insatiable demand for High Bandwidth Memory (HBM) may be tempered by algorithmic efficiency. As we move deeper into 2026, the arrival of TurboQuant suggests that the next era of AI progress will be defined as much by mathematical elegance as by brute force. By redefining efficiency through extreme compression, Google is enabling "smarter memory movement" for multi-step agents and dense retrieval pipelines. The industry is shifting from a focus on "bigger models" to "better memory," a change that could lower AI serving costs globally. For enterprises currently using or fine-tuning their own AI models, the release of TurboQuant offers a rare opportunity for immediate operational improvement. Unlike many AI breakthroughs that require costly retraining or specialized datasets, TurboQuant is training-free and data-oblivious. This means organizations can apply these quantization techniques to their existing fine-tuned models -- whether they are based on Llama, Mistral, or Google's own Gemma -- to realize immediate memory savings and speedups without risking the specialized performance they have worked to build. From a practical standpoint, enterprise IT and DevOps teams should consider the following steps to integrate this research into their operations: Optimize Inference Pipelines: Integrating TurboQuant into production inference servers can reduce the number of GPUs required to serve long-context applications, potentially slashing cloud compute costs by 50% or more. Expand Context Capabilities: Enterprises working with massive internal documentation can now offer much longer context windows for retrieval-augmented generation (RAG) tasks without the massive VRAM overhead that previously made such features cost-prohibitive. Enhance Local Deployments: For organizations with strict data privacy requirements, TurboQuant makes it feasible to run highly capable, large-scale models on on-premise hardware or edge devices that were previously insufficient for 32-bit or even 8-bit model weights. Re-evaluate Hardware Procurement: Before investing in massive HBM-heavy GPU clusters, operations leaders should assess how much of their bottleneck can be resolved through these software-driven efficiency gains. Ultimately, TurboQuant proves that the limit of AI isn't just how many transistors we can cram onto a chip, but how elegantly we can translate the infinite complexity of information into the finite space of a digital bit. For the enterprise, this is more than just a research paper; it is a tactical unlock that turns existing hardware into a significantly more powerful asset.
[5]
Google Shrinks AI Memory With No Accuracy Loss -- But There's a Catch - Decrypt
The method compresses inference memory, not model weights, and has only been tested in research benchmarks. Google Research published TurboQuant on Wednesday, a compression algorithm that shrinks a major inference-memory bottleneck by at least 6x while maintaining zero loss in accuracy. The paper is slated for presentation at ICLR 2026, and the reaction online was immediate. Cloudflare CEO Matthew Prince called it Google's DeepSeek moment. Memory stock prices, including Micron, Western Digital, and Seagate, fell on the same day. Quantization efficiency is a big achievement by itself. But "zero accuracy loss" needs context. TurboQuant targets the KV cache -- the chunk of GPU memory that stores everything a language model needs to remember during a conversation. As context windows grow toward millions of tokens, those caches balloon into hundreds of gigabytes per session. That's the actual bottleneck. Not compute power but raw memory. Traditional compression methods try to shrink those caches by rounding numbers down -- from 32-bit floats to 16, to 8 to 4-bit integers, for example. To better understand it, think of shrinking an image from 4K, to full HD, to 720p and so. It's easy to tell it's the same image overall, but there's more detail in 4K resolution. The catch: they have to store extra "quantization constants" alongside the compressed data to keep the model from going stupid. Those constants add 1 to 2 bits per value, partially eroding the gains. TurboQuant claims it eliminates that overhead entirely. It does this via two sub-algorithms. PolarQuant separates magnitude from direction in vectors, and QJL (Quantized Johnson-Lindenstrauss) takes the tiny residual error left over and reduces it to a single sign bit, positive or negative, with zero stored constants. The result, Google says, is a mathematically unbiased estimator for the attention calculations that drive transformer models. In benchmarks using Gemma and Mistral, TurboQuant matched full-precision performance under 4x compression, including perfect retrieval accuracy on needle-in-haystack tasks up to 104,000 tokens. For context on why those benchmarks matter, expanding a model's usable context without quality loss has been one of the hardest problems in LLM deployment. Now, the fine print. "Zero accuracy loss" applies to KV cache compression during inference -- not to the model's weights. Compressing weights is a completely different, harder problem. TurboQuant doesn't touch those. What it compresses is the temporary memory storing mid-session attention computations, which is more forgiving because that data can theoretically be reconstructed. There's also the gap between a clean benchmark and a production system serving billions of requests. TurboQuant was tested on open-source models -- Gemma, Mistral, Llama -- not Google's own Gemini stack at scale. Unlike DeepSeek's efficiency gains, which required deep architectural decisions baked in from the start, TurboQuant requires no retraining or fine-tuning and claims negligible runtime overhead. In theory, it drops straight into existing inference pipelines. That's the part that spooked the memory hardware sector -- because if it works in production, every major AI lab runs leaner on the same GPUs they already own. The paper goes to ICLR 2026. Until it ships in production, the "zero loss" headline stays in the lab.
[6]
Google develops TurboQuant compression technology for AI models - SiliconANGLE
Google develops TurboQuant compression technology for AI models Google LLC has unveiled a technology called TurboQuant that can speed up artificial intelligence models and lower their memory requirements. Amir Zandieh and Vahab Mirrokni, two of the researchers who worked on the project, explained how it works in a Tuesday blog post. One way to speed up AI models is to reduce the amount of data they must process to make decisions. That can be achieved by compressing the input data that a model ingests. There are many algorithms that can compress AI models' input data, but they often provide only limited efficiency improvements. Additionally, they can introduce errors into the data they compress, which lowers AI models' output quality. According to Google, TurboQuant can not only compress AI models' data more efficiently than existing algorithms but also do so with fewer errors. It does so by changing the data's mathematical properties. AI models represent the data they process in the form of vectors. A vector is a geometric object that is often visualized as a simple two-dimensional line. The line has two main properties: length and direction. An arrow indicates the direction of the line. In practice, advanced AI models store data using not simple two-dimensional lines but so-called high-dimensional vectors. What sets such vectors apart from a simple line is that they point in up to thousands of different directions rather than just one. A high-dimensional vector can store a piece of data such as a sentence or an equation. The fact that vectors have a direction means that they can be rotated in an abstract sense of the word. TurboQuant harnesses that property to optimize AI models' data. According to Google, it uses an approach called random preconditioning to rotate an AI model's vectors in a way that makes them easier to compress. It then compresses them with an algorithm called a quantizer. The primary benefit of rotating vectors is that it shields them from data errors during the compression process. However, a small number of errors still find their way into the vectors. TurboQuant fixes those inaccuracies using an algorithm called QJL. "QJL uses a mathematical technique called the Johnson-Lindenstrauss Transform to shrink complex, high-dimensional data while preserving the essential distances and relationships between data points," Zandieh and Mirrokni explained. "This algorithm essentially creates a high-speed shorthand that requires zero memory overhead."
[7]
Google's TurboQuant explained: The JPEG approach to AI compression
How do you try to make sense of Google's TurboQuant tech, especially if you're not a cutting-edge tech pro? The tech behind what Google's trying to do seems so impactful, but what good is it if it doesn't make sense, right? Connecting it to tech powering images and pictures we see on a daily basis seems like a good place to start. Also read: OpenAI discontinuing Sora AI video: What went wrong for ChatGPT maker? Think about what happens when you save a photo as a JPEG. The file trims away details that your eyes won't notice anyway. Tiny variations in color, subtle textures, things that don't really change how the image looks to you. The result still looks the same to you, but the file size drops massively. The real trick isn't what it keeps, it's knowing what it can safely throw away. That's what TurboQuant also does on a very different scale. When an AI model processes a long conversation or a large document, it stores everything in its working memory as a huge grid of numbers. These numbers are extremely precise, and that precision comes at a cost. More memory means more computing power, more energy, and ultimately higher costs. Also read: Best smartwatches with WhatsApp calling and messaging in 2026 What TurboQuant does is surprisingly simple in concept. It asks the same question as JPEGs, how much of this detail actually matters? Instead of keeping everything at high precision, it compresses those numbers. We're talking about shrinking them from 32-bit precision down to just 3 or 4 bits. Which, when you say it out loud, sounds like a huge loss that could break everything. However, there is nuance to it. It adds a tiny correction layer, just one extra bit, to fix any important errors that might creep in. The result is kind of wild. Memory usage drops by up to six times. Processing becomes significantly faster. And somehow, the model still performs almost exactly the same. What I find most interesting isn't just the efficiency gains. It's how familiar the idea feels. This isn't some completely alien breakthrough. It's a principle we've been using for decades. JPEG did it for images back in the early 90s. TurboQuant is doing it for AI today. Progress in tech doesn't always come from adding more. Oftentimes, it comes from knowing what you can afford to lose.
Share
Share
Copy Link
Google Research unveiled TurboQuant, a compression algorithm that reduces large language model memory footprint by at least 6x while delivering an 8x performance boost on Nvidia H100 GPUs. The breakthrough targets the key-value cache bottleneck without sacrificing accuracy, but memory chip stocks including Micron and Western Digital dropped as investors reconsidered future demand for AI hardware.
Google Research published TurboQuant this week, a compression algorithm designed to address one of the most expensive challenges in running large language models: the escalating memory demands of the key-value cache
1
. This digital storage system holds context information so models don't have to recompute data with every token generated, but as context windows expand, the cache consumes massive amounts of GPU memory that could otherwise serve more users or run larger models3
. TurboQuant compresses the cache to just 3 bits per value, down from the standard 16, achieving at least a 6x reduction in memory usage2
. In benchmarks on Nvidia H100 GPUs, 4-bit TurboQuant delivered up to an 8x performance boost in computing attention logits compared to unquantized 32-bit keys2
.
Source: Digit
The innovation behind TurboQuant lies in eliminating the overhead that makes most compression techniques less effective than their headline numbers suggest
3
. Traditional vector quantization methods reduce the size of data vectors but must store additional normalization constants alongside the compressed data, typically adding one or two extra bits per number and partially undoing the compression gains4
. TurboQuant avoids this through a two-stage process developed by research scientist Amir Zandieh and VP Vahab Mirrokni, along with collaborators at Google DeepMind, KAIST, and New York University3
. The first stage, called PolarQuant, converts data vectors from standard Cartesian coordinates into polar coordinates, separating each vector into a radius representing magnitude and a set of angles representing direction4
. Because the angular distributions follow predictable, concentrated patterns after a random rotation, the system skips expensive per-block normalization steps entirely2
.
Source: Ars Technica
The second stage applies a 1-bit error correction layer using an algorithm called Quantized Johnson-Lindenstrauss, or QJL
1
. QJL projects the residual quantization error from PolarQuant into a lower-dimensional space and reduces each value to a single sign bit, either +1 or -1, while preserving the essential vector data that describes relationships1
. This serves as a zero-bias estimator that eliminates systematic bias in attention score calculations at negligible additional cost2
. The result is a more accurate attention score, the fundamental process by which neural networks decide what data is important when processing queries1
.Google tested TurboQuant across five standard benchmarks for long-context language models, including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models from the Gemma, Mistral, and Llama families
2
. TurboQuant achieved perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing KV memory by at least six times2
. On the LongBench suite, which covers question answering, code generation, and summarization, TurboQuant matched or outperformed the KIVI baseline across all tasks2
. The algorithm requires no training or fine-tuning and incurs negligible runtime overhead, making it suitable for deployment in production inference and large-scale vector search systems2
. Importantly, TurboQuant targets the inference memory bottleneck during KV cache compression, not the model's weights, which is a completely different compression challenge5
.Related Stories
Within hours of Google's announcement, memory chip stocks fell as investors recalculated how much physical memory the AI industry might actually need
3
. Micron dropped 3 percent, Western Digital lost 4.7 percent, and SanDisk fell 5.7 percent3
. Wells Fargo analyst Andrew Rocha noted that TurboQuant directly attacks the cost curve for AI memory systems, quickly raising questions about actual capacity requirements3
. However, analysts cautioned that the demand picture for AI memory remains strong, and compression algorithms have existed for years without fundamentally altering procurement volumes3
. The paper will be presented at ICLR 2026 next month, with related work on PolarQuant appearing at AISTATS 20262
.TurboQuant arrives as the AI industry confronts the economics of inference, where serving millions of queries per day with acceptable latency determines whether AI products are financially viable at scale
3
. The KV cache is the bottleneck that limits how many concurrent users a single GPU can serve and how long a context window a model can practically support3
. If implemented, TurboQuant could reduce serving costs by more than 50 percent for enterprises that deploy it on their models4
. The algorithm can quantize the cache to just 3 bits with no additional training, so it can be applied to existing models without architectural changes1
. Mobile AI could see particular benefit, as hardware limitations on smartphones make compression techniques like TurboQuant valuable for improving output quality without sending data to the cloud1
. The gap between clean benchmarks and production systems serving billions of requests remains to be seen, as TurboQuant was tested on open-source models rather than Google's own Gemini stack at scale5
.Summarized by
Navi
[4]
Todayβ’Technology

21 Oct 2025β’Technology

13 Jan 2026β’Technology

1
Technology

2
Entertainment and Society

3
Policy and Regulation
