DeepSeek's Engram tackles AI compute waste with conditional memory breakthrough

Reviewed byNidhi Govil

5 Sources

Share

DeepSeek unveiled Engram, a conditional memory technique that separates static information retrieval from complex reasoning tasks. The method allows AI models to bypass GPU memory constraints by committing knowledge to system RAM, reducing reliance on expensive high-bandwidth memory while improving performance on long-context queries.

DeepSeek introduces conditional memory to address AI efficiency challenges

DeepSeek has released a technical paper detailing Engram, a conditional memory-based approach that fundamentally changes how AI models handle knowledge retrieval and reasoning. Co-authored by DeepSeek CEO Liang Wenfeng, the research tackles a critical inefficiency in Transformer models: the wasteful use of GPU cycles for simple lookups that could be handled through direct memory access

1

3

. When enterprise AI systems retrieve basic information like product names or technical specifications, they currently use expensive GPU computation designed for complex reasoning tasks, wasting cycles millions of times per day and inflating infrastructure costs

3

.

The research arrives as organizations face mounting pressure to deploy more capable AI systems while navigating GPU memory constraints and rising hardware costs. This HBM bottleneck is widely recognized as a key reason DRAM prices rose by 5X in just 10 weeks, as hardware demand spiked to support large AI models

2

. For Chinese AI labs operating under US export controls on GPUs, Engram offers a potential path forward by optimizing algorithmic efficiency rather than relying on brute-force compute scaling

4

.

How Engram separates static lookups from dynamic reasoning

Engram works by decoupling compute power from memory storage, allowing models to efficiently look up essential information without overloading GPU memory

2

. The system uses N-grams—statistical sequences of words—integrated into the model's neural networks and placed into a queryable memory bank

1

. This enables models to remember facts rather than having to reason them out through computationally expensive neural computation each time.

Source: Geeky Gadgets

Source: Geeky Gadgets

The mechanism relies on hash-based lookups for static information retrieval. Token combinations extracted from input text are hashed to retrieve embeddings from a pre-trained lookup table stored in system RAM

5

. A standard Mixture of Experts (MoE) model might have to reconstruct these pieces of data every time they're referenced in a query through conditional computation, calling on expert parameters to assemble and reason the data even when focusing on certain parts

1

. Engram allows the model to simply ask "Do I already have this data?" rather than accessing the parts of the model that deal with reasoning.

To prevent errors, Engram employs a context-aware gating mechanism that filters retrieved patterns. A hash lookup for "Apple" might collide with unrelated content, or the word might mean the fruit rather than the company

3

. The model's current understanding of context acts as a filter—if retrieved memory contradicts the current context, the gate suppresses it; if it fits, the gate lets it through. This design allows models to handle long-context handling more efficiently and supports system-level prefetching with minimal performance overhead [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks).

Optimal parameter allocation reduces computational waste

Through systematic experiments, DeepSeek discovered the optimal balance between computation and memory: 75-80% of sparse model capacity allocated to dynamic reasoning tasks and 20-25% to static lookups

3

. Tests showed that reallocating around 20-25% of the parameter budget to Engram yields better performance than pure MoE models, maintaining stable gains across different scales [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks).

Source: VentureBeat

Source: VentureBeat

An Engram-based model scaled to nearly 27 billion parameters demonstrated measurable improvements across standard industry benchmarks

1

. Complex reasoning benchmarks jumped from 70% to 74% accuracy, while knowledge-focused tests improved from 57% to 61%, with improvements measured across Big-Bench Hard, ARC-Challenge, and MMLU

3

. Memory slot expansion provides predictable improvements without additional computational cost, confirming the scalability of conditional memory as an independent axis for sparse models [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks).

Infrastructure implications and hardware efficiency gains

Engram's deterministic retrieval mechanism allows memory capacity to scale linearly across multiple GPUs while supporting asynchronous prefetching during inference [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks). The approach works with existing GPU and system memory architectures, potentially avoiding costly high-bandwidth memory (HBM) upgrades. It also aligns with emerging CXL (Compute Express Link) standards, which aim to overcome GPU memory constraints in large-scale AI workloads [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks).

The technique may relieve pressure on expensive memory hardware, particularly in regions where HBM access lags behind competitors like Samsung, SK Hynix, and Micron [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks). Engram differs from solutions like Nvidia's KVCache, which offloads context data to NVMe memory with BlueField-4

1

. While KVCache acts as short-term memory for recent conversations, Engram provides persistent access to pre-calculated data—essentially storing the whole encyclopedia rather than just handwritten notes.

Chris Latimer, founder and CEO of Vectorize, notes that conditional memory solves a different problem than agentic AI memory systems: "It's more geared towards squeezing performance out of smaller models and getting more mileage out of scarce GPU resources"

3

. Early validation suggests models can expand parameter scale and reasoning capacity while managing memory demands more efficiently, potentially reducing sharp DDR5 DRAM price swings [2](https://www.techradar.com/pro/deepseek-may-have-found-a-way-to-solve-the-ram-crisis-by-eliminating-the-need-for-expensive-hbm-for-ai-inference-and-training-yes, the very reason why dram prices went up by 5x in 10 weeks). Organizations should monitor how Engram performs in production deployments and whether its approach to reduce computational waste becomes standard practice for optimizing AI infrastructure costs.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo