3 Sources
3 Sources
[1]
Deepseek research touts memory breakthrough, decoupling compute power and RAM pools to bypass GPU & HBM constraints -- Engram conditional memory module commits static knowledge to system RAM
DeepSeek has released a new technical paper, which details a new method for how new AI models might rely on a queryable database of information committed to system memory. Named "Engram", the conditional memory-based technique achieves demonstrably higher performance in long-context queries by committing sequences of data to static memory. This eases the reliance on reasoning for AI models, allowing the GPUs to only handle more complex tasks, increasing performance, and reducing the reliance on high-bandwidth memory (HBM). The paper details how N-grams, statistical sequences of words, are integrated into the model's neural networks, allowing them to be placed into a queryable memory bank. Engram allows models to remember facts, rather than having to reason them out, which is more computationally expensive. Released on the company's GitHub page, Engram hopes to address how the company might be able to curb the reliance on more complex memory types and instead commit a knowledge library to a more common system memory standard, such as CXL. The ongoing reliance on high-bandwidth memory for AI accelerators is something that even Chinese silicon, such as Huawei's Ascend series, cannot escape. Each stack of HBM uses more memory dies, and with demand skyrocketing, easing any AI model's reliance on the GPU's direct high-bandwidth memory would be significant, especially considering the ongoing memory supply squeeze. Engram would enable static memory to be held separately from an LLM's compute power, allowing the GPU's rapid HBM to dedicate itself to reasoning, therefore enabling more performant Engram-based AI models, compared to a standard Mixture of Experts (MoE) model. As detailed in the paper, an Engram-based model scaled to nearly 27 billion parameters can beat out a standard MoE model in long-context training and eliminates computational waste generated by having to reason out facts, by allowing them to be externally stored. A standard MoE model might have to reconstruct these pieces of data every time it's referenced in a query, which is called conditional computation. The model will then call on its expert parameters to assemble and reason the data every time, even when it only focuses the query on certain parts or experts, named sparse computation. The Engram paper adds that placing conditional memory would allow the model to merely ask: "Do I already have this data?", rather than having to access the parts of the model that deal with reasoning. "This process essentially amounts to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning," the paper reads. Engram takes static patterns and lists its knowledge index into a parsable piece of conditional memory with a store of information, relieving the AI model from the burden of having to reason through context repeatedly. While Nvidia's KVCache, announced at CES 2026, offloads context data to NVMe memory with BlueField-4, this acts as more of a short-term solution, allowing the model to remember things that you have recently said or added within context, and is, for all intents and purposes, disposable after you move on to the next query or conversation. KVCache, while persistent within the history of your conversations or queries, does not draw on an existing base of pre-calculated data, and is not persistent in the same way that Engram-based LLMs could be, if the paper is to be believed. To put it simply, KVCache can be likened to storing your handwritten notes, whereas Engram is a record of the whole encyclopedia. This is enabled through tokenizer compression, which compresses equivalent tokens (such as the same word with different forms of capitalization) as the same, canonical concept. This allowed Deepseek to reduce the vocabulary size for the conditional memory module by 23%, and allows for rapid parsing of information in context. As there is an impossibly large number of phrases or combinations of words within a certain context, they employ a methodology named Hashing, which allows the model to apply a number to a series of words. Engram adds to this, with what it calls Multi-Head Hashing, where you can put several hashes onto multiple numbers, for that single phrase to avoid erroneously adding the wrong context. For example, Universal might be a single entry, distinct from Universal Studios, with Multi-Head Hashing employed to ensure no mistakes or database errors. This is then passed on to Engram's context-aware gating, which then confirms that the term matches the context of the sentence it's being used in, before being deployed into an output. To examine how Engram-based LLMs might work in large-scale deployments, Deepseek detailed how it might achieve the best allocation between embeddings of Engram and MoE parameters within an AI model. The outcome was a U-curve, which proved that memory and compute (or reasoning) can be considered mathematically distinct forms of intelligence within AI models. This resulted in a sweetspot for MoE and Engram embeddings. "Remarkably, the Engram model achieves comparable performance to the pure MoE baseline (π = 100%) even when the MoE allocation is reduced to just π β 40% (i.e., a total of 46 experts for the 5.7B model and 43 experts for the 9.9B model). Furthermore, the pure MoE baseline proves suboptimal: reallocating roughly 20%-25% of the sparse parameter budget to Engram yields the best performance." Deepseek itself remarks on how both Engram-dominated and MoE-dominated models falter, whereas a ratio that yields 20-25% of the overall parameter budget of the model to Engram achieves the best results. Deepseek ran another experiment in parallel, which it names the "Infinite Memory Regime." This effectively keeps the computational budget fixed, so the model doesn't get more expensive to run, and attaches a near infinite number of conditional memory parameters to be deployed using Engram. What they found was that since Engram is distinct from the overall compute budget (since it's effectively a long-term storage bank, which taps into the overall model), Deepseek discovered that performance scales linearly with memory size. Meaning that if a model continued to add to its conditional memory banks, its performance would only continue to improve, without having to increase the overall compute budget. This could have significant implications for the wider AI industry if performance and results are not singularly bound by compute, but to long-term "Engram" memory banks. If the performance benefits are indeed as good as the paper outlines, the memory squeeze would no longer be singularly based on the deployment of HBM, but all forms of memory that could be deployed within data centers, either through CXL or other methods of interconnection. Deepseek deployed an Engram-27B parameter model and a standard 27B MoE model in parallel to determine the performance benefits of computational memory within AI models, and the results were exemplary. Within knowledge-intensive tasks, Engram was 3.4 to 4 points better than its MoE equivalent, and it was even better at reasoning, with a 3.7 to 5 point uplift when compared to its MoE "reasoning-only" sibling. Similar results were also achieved in coding and mathematics-based tests. However, the big win for Engram was in long-context tasks, increasing accuracy within the NIAH (Needle in a Haystack) benchmark to 97%, which is a leap from the MoE model's score of 84.2%. This is a large difference in reliability between the models, and could point toward AI's long-context and coherence issues eventually becoming a thing of the past, if Engram were to be deployed in a commercial AI model, especially if the demands for long-context AI queries increase. Engram has significant implications for the AI industry, especially as the paper details how this specific methodology is no longer bound by HBM, but instead longer-term storage. System DRAM can now be utilized to significantly improve the quality of Engram-based LLM outputs, meaning that the much more expensive HBM will only be used for computationally heavy queries. Of course, if Engram were to take off, it may worsen the ongoing DRAM supply crisis, as AI hyperscalers adopting the methodology would then flock to system DRAM, instead of solely focusing on putting all of their memory ICs in production into HBM for GPUs. "We envision conditional memory functions as an indispensable modeling primitive for next-generation sparse models," Deepseek said, hinting at a possible V4 deploying Engram in a new AI model. With the company rumored to announce a new AI model within the next few weeks, don't be surprised if it implements Engram within it. While the results are impressive on paper, Engram's impact has yet to be determined in real-world deployment. But, if everything the paper says holds in a real-world context, the company could be onto a new 'Deepseek moment.'
[2]
DeepSeek's conditional memory fixes silent LLM waste: GPU cycles lost to static lookups
When an enterprise LLM retrieves a product name, technical specification, or standard contract clause, it's using expensive GPU computation designed for complex reasoning -- just to access static information. This happens millions of times per day. Each lookup wastes cycles and inflates infrastructure costs. DeepSeek's newly released research on "conditional memory" addresses this architectural limitation directly. The work introduces Engram, a module that separates static pattern retrieval from dynamic reasoning. It delivers results that challenge assumptions about what memory is actually for in neural networks. The paper was co-authored by DeepSeek founder Liang Wenfeng. Through systematic experiments DeepSeek found the optimal balance between computation and memory with 75% of sparse model capacity allocated to dynamic reasoning and 25% to static lookups. This memory system improved reasoning more than knowledge retrieval. Complex reasoning benchmarks jumped from 70% to 74% accuracy, while knowledge-focused tests improved from 57% to 61%. These improvements came from tests including Big-Bench Hard, ARC-Challenge, and MMLU. The research arrives as enterprises face mounting pressure to deploy more capable AI systems while navigating GPU memory constraints and infrastructure costs. DeepSeek's approach offers a potential path forward by fundamentally rethinking how models should be structured. How conditional memory solves a different issue than agentic memory and RAG Agentic memory systems, sometimes referred to as contextual memory -- like Hindsight, MemOS, or Memp -- focus on episodic memory. They store records of past conversations, user preferences, and interaction history. These systems help agents maintain context across sessions and learn from experience. But they're external to the model's forward pass and don't optimize how the model internally processes static linguistic patterns. For Chris Latimer, founder and CEO of Vectorize, which developed Hindsight, the conditional memory approach used in Engram solves a different problem than agentic AI memory. "It's not solving the problem of connecting agents to external memory like conversation histories and knowledge stores," Latimer told VentureBeat. "It's more geared towards squeezing performance out of smaller models and getting more mileage out of scarce GPU resources." Conditional memory tackles a fundamental issue: Transformers lack a native knowledge lookup primitive. When processing text, they must simulate retrieval of static patterns through expensive neural computation across multiple layers. These patterns include named entities, technical terminology, and common phrases. The DeepSeek paper illustrates this with a concrete example. Recognizing "Diana, Princess of Wales" requires consuming multiple layers of attention and feed-forward networks to progressively compose features. The model essentially uses deep, dynamic logic circuits to perform what should be a simple hash table lookup. It's like using a calculator to remember your phone number rather than just looking it up. "The problem is that Transformer lacks a 'native knowledge lookup' ability," the researchers write. "Many tasks that should be solved in O(1) time like retrieval have to be 'simulated for retrieval' through a large amount of computation, which is very inefficient." How conditional memory works Engram introduces "conditional memory" to work alongside MoE's conditional computation. The mechanism is straightforward. The module takes sequences of two to three tokens and uses hash functions to look them up in a massive embedding table. Retrieval happens in constant time, regardless of table size. But retrieved patterns need filtering. A hash lookup for "Apple" might collide with unrelated content, or the word might mean the fruit rather than the company. Engram solves this with a gating mechanism. The model's current understanding of context (accumulated through earlier attention layers) acts as a filter. If retrieved memory contradicts the current context, the gate suppresses it. If it fits, the gate lets it through. The module isn't applied at every layer. Strategic placement balances performance gains against system latency. This dual-system design raises a critical question: How much capacity should each get? DeepSeek's key finding: the optimal split is 75-80% for computation and 20-25% for memory. Testing found pure MoE (100% computation) proved suboptimal. Too much computation wastes depth reconstructing static patterns; too much memory loses reasoning capacity. Infrastructure efficiency: the GPU memory bypass Perhaps Engram's most pragmatic contribution is its infrastructure-aware design. Unlike MoE's dynamic routing, which depends on runtime hidden states, Engram's retrieval indices depend solely on input token sequences. This deterministic nature enables a prefetch-and-overlap strategy. "The challenge is that GPU memory is limited and expensive, so using bigger models gets costly and harder to deploy," Latimer said. "The clever idea behind Engram is to keep the main model on the GPU, but offload a big chunk of the model's stored information into a separate memory on regular RAM, which the model can use on a just-in-time basis." During inference, the system can asynchronously retrieve embeddings from host CPU memory via PCIe. This happens while GPU computes preceding transformer blocks. Strategic layer placement leverages computation of early layers as a buffer to mask communication latency. The researchers demonstrated this with a 100B-parameter embedding table entirely offloaded to host DRAM. They achieved throughput penalties below 3%. This decoupling of storage from compute addresses a critical enterprise constraint as GPU high-bandwidth memory remains expensive and scarce. What this means for enterprise AI deployment For enterprises evaluating AI infrastructure strategies, DeepSeek's findings suggest several actionable insights: 1. Hybrid architectures outperform pure approaches. The 75/25 allocation law indicates that optimal models should split sparse capacity between computation and memory. 2. Infrastructure costs may shift from GPU to memory. If Engram-style architectures prove viable in production, infrastructure investment patterns could change. The ability to store 100B+ parameters in CPU memory with minimal overhead suggests that memory-rich, compute-moderate configurations may offer better performance-per-dollar than pure GPU scaling. 3. Reasoning improvements exceed knowledge gains. The surprising finding that reasoning benefits more than knowledge retrieval suggests that memory's value extends beyond obvious use cases. For enterprises leading AI adoption, Engram demonstrates that the next frontier may not be simply bigger models. It's smarter architectural choices that respect the fundamental distinction between static knowledge and dynamic reasoning. The research suggests that optimal AI systems will increasingly resemble hybrid architectures. Organizations waiting to adopt AI later in the cycle should monitor whether major model providers incorporate conditional memory principles into their architectures. If the 75/25 allocation law holds across scales and domains, the next generation of foundation models may deliver substantially better reasoning performance at lower infrastructure costs.
[3]
Decoding DeepSeek's Solution to China's Compute Shortage | AIM
DeepSeek's new research enables retrieval using computational memory, not neural computation, freeing up GPUs. Ahead of the highly anticipated launch of its v4 model, DeepSeek has published research that could fundamentally reshape how large language models handle knowledge, and potentially sidestep the hardware constraints hampering Chinese AI development. In a paper co-authored by DeepSeek CEO Liang Wenfeng, the research introduces "Engram", a method that allows language models to retrieve knowledge through direct lookup rather than wasteful computation. DeepSeek's work matters because Chinese AI labs are looking for algorithmic efficiency, as they are running out of room to scale using brute-force compute due to US export controls on GPUs. The paper explains that much of the GPU budget is spent reconstructing information that could be retrieved directly from memory or caches. As the authors put it, "Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation." While Mixture-of-Experts (MoE) models
Share
Share
Copy Link
DeepSeek released research on Engram, a conditional memory module that separates static information retrieval from dynamic reasoning. The breakthrough enables AI models to commit static knowledge to system RAM instead of relying on expensive GPU computation, addressing China's compute shortage while reducing infrastructure costs and improving performance by 23% through tokenizer compression.
DeepSeek has released groundbreaking research detailing Engram, a conditional memory module that fundamentally changes how large language models handle knowledge retrieval
1
2
. Co-authored by DeepSeek CEO Liang Wenfeng, the paper addresses a critical inefficiency: enterprise LLMs waste expensive GPU cycles retrieving static information like product names and technical specifications millions of times daily2
. The conditional memory approach allows models to commit static knowledge to system RAM, fundamentally decoupling compute power and RAM pools to bypass GPU constraints that have plagued AI infrastructure1
.
Source: Tom's Hardware
The research arrives as Chinese AI labs face mounting pressure from US export controls on GPUs, forcing them to seek algorithmic efficiency rather than brute-force compute scaling
3
. By separating static information retrieval from dynamic reasoning, Engram offers a practical path forward for organizations navigating China's compute shortage while simultaneously addressing broader infrastructure cost concerns2
3
.Transformer models lack a native knowledge lookup primitive, forcing them to simulate retrieval of static patterns through expensive neural computation across multiple layers
2
. Recognizing "Diana, Princess of Wales" requires consuming multiple attention and feed-forward network layers to progressively compose featuresβessentially using deep, dynamic logic circuits for what should be a simple hash table lookup2
. DeepSeek's paper explains this as an "expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning"1
.
Source: VentureBeat
Engram integrates N-gramsβstatistical sequences of wordsβinto queryable memory banks, allowing models to remember facts rather than reason them out
1
. The module takes sequences of two to three tokens and uses hash functions to look them up in a massive embedding table, with retrieval happening in constant time regardless of table size2
. This approach employs Multi-Head Hashing to distinguish between similar termsβensuring "Universal" remains distinct from "Universal Studios"βbefore passing through context-aware gating mechanisms that confirm terms match sentence context1
.Through systematic experiments, DeepSeek found the optimal balance allocates 75-80% of sparse model capacity to dynamic reasoning and 20-25% to static lookups
2
. An Engram-based model scaled to nearly 27 billion parameters outperformed standard Mixture of Experts (MoE) models in long-context query performance while eliminating computational waste1
. Complex reasoning benchmarks jumped from 70% to 74% accuracy, while knowledge-focused tests improved from 57% to 61% across Big-Bench Hard, ARC-Challenge, and MMLU evaluations.Tokenizer compression reduced vocabulary size for the conditional memory module by 23%, compressing equivalent tokens with different capitalizations as the same canonical concept
1
. Standard MoE models must reconstruct data every time it's referenced through conditional computation, calling on expert parameters repeatedly even for identical queries1
. Engram simply asks: "Do I already have this data?" before accessing reasoning-focused model components1
.Related Stories
Engram's most practical contribution lies in its infrastructure-aware design that enables a GPU memory bypass strategy
2
. Unlike Nvidia's KVCache announced at CES 2026, which offloads context data to NVMe memory with BlueField-4 as a short-term solution for recent conversations, Engram maintains persistent pre-calculated data1
. As the paper distinguishes: KVCache stores handwritten notes while Engram records the whole encyclopedia1
.Chris Latimer, founder and CEO of Vectorize, notes that conditional memory solves a different problem than Retrieval-Augmented Generation (RAG) or agentic memory systems like Hindsight: "It's more geared towards squeezing performance out of smaller models and getting more mileage out of scarce GPU resources"
2
. The deterministic nature of Engram's retrievalβdependent solely on input token sequences rather than runtime hidden statesβenables prefetch-and-overlap strategies that standard MoE dynamic routing cannot achieve2
.By committing knowledge libraries to common system DRAM standards like CXL, Engram addresses the ongoing reliance on High-Bandwidth Memory (HBM) that constrains AI infrastructure even for Chinese silicon like Huawei's Ascend series
1
. This separation allows GPU HBM to dedicate itself to reasoning while static memory operates independently, enabling more performant models as organizations face memory supply squeezes and mounting infrastructure costs2
.Summarized by
Navi
[2]
1
Technology

2
Policy and Regulation

3
Technology
