DeepSeek unveils Engram conditional memory to bypass GPU constraints and cut wasted cycles

Reviewed byNidhi Govil

3 Sources

Share

DeepSeek released research on Engram, a conditional memory module that separates static information retrieval from dynamic reasoning. The breakthrough enables AI models to commit static knowledge to system RAM instead of relying on expensive GPU computation, addressing China's compute shortage while reducing infrastructure costs and improving performance by 23% through tokenizer compression.

DeepSeek Introduces Engram to Address AI Memory Inefficiency

DeepSeek has released groundbreaking research detailing Engram, a conditional memory module that fundamentally changes how large language models handle knowledge retrieval

1

2

. Co-authored by DeepSeek CEO Liang Wenfeng, the paper addresses a critical inefficiency: enterprise LLMs waste expensive GPU cycles retrieving static information like product names and technical specifications millions of times daily

2

. The conditional memory approach allows models to commit static knowledge to system RAM, fundamentally decoupling compute power and RAM pools to bypass GPU constraints that have plagued AI infrastructure

1

.

Source: Tom's Hardware

Source: Tom's Hardware

The research arrives as Chinese AI labs face mounting pressure from US export controls on GPUs, forcing them to seek algorithmic efficiency rather than brute-force compute scaling

3

. By separating static information retrieval from dynamic reasoning, Engram offers a practical path forward for organizations navigating China's compute shortage while simultaneously addressing broader infrastructure cost concerns

2

3

.

How Engram Reduces Wasted GPU Cycles Through Static Lookups

Transformer models lack a native knowledge lookup primitive, forcing them to simulate retrieval of static patterns through expensive neural computation across multiple layers

2

. Recognizing "Diana, Princess of Wales" requires consuming multiple attention and feed-forward network layers to progressively compose featuresβ€”essentially using deep, dynamic logic circuits for what should be a simple hash table lookup

2

. DeepSeek's paper explains this as an "expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning"

1

.

Source: VentureBeat

Source: VentureBeat

Engram integrates N-gramsβ€”statistical sequences of wordsβ€”into queryable memory banks, allowing models to remember facts rather than reason them out

1

. The module takes sequences of two to three tokens and uses hash functions to look them up in a massive embedding table, with retrieval happening in constant time regardless of table size

2

. This approach employs Multi-Head Hashing to distinguish between similar termsβ€”ensuring "Universal" remains distinct from "Universal Studios"β€”before passing through context-aware gating mechanisms that confirm terms match sentence context

1

.

Optimal Balance Between Memory and Computation Drives Performance

Through systematic experiments, DeepSeek found the optimal balance allocates 75-80% of sparse model capacity to dynamic reasoning and 20-25% to static lookups

2

. An Engram-based model scaled to nearly 27 billion parameters outperformed standard Mixture of Experts (MoE) models in long-context query performance while eliminating computational waste

1

. Complex reasoning benchmarks jumped from 70% to 74% accuracy, while knowledge-focused tests improved from 57% to 61% across Big-Bench Hard, ARC-Challenge, and MMLU evaluations.

Tokenizer compression reduced vocabulary size for the conditional memory module by 23%, compressing equivalent tokens with different capitalizations as the same canonical concept

1

. Standard MoE models must reconstruct data every time it's referenced through conditional computation, calling on expert parameters repeatedly even for identical queries

1

. Engram simply asks: "Do I already have this data?" before accessing reasoning-focused model components

1

.

GPU Memory Bypass Distinguishes Engram from RAG and KVCache

Engram's most practical contribution lies in its infrastructure-aware design that enables a GPU memory bypass strategy

2

. Unlike Nvidia's KVCache announced at CES 2026, which offloads context data to NVMe memory with BlueField-4 as a short-term solution for recent conversations, Engram maintains persistent pre-calculated data

1

. As the paper distinguishes: KVCache stores handwritten notes while Engram records the whole encyclopedia

1

.

Chris Latimer, founder and CEO of Vectorize, notes that conditional memory solves a different problem than Retrieval-Augmented Generation (RAG) or agentic memory systems like Hindsight: "It's more geared towards squeezing performance out of smaller models and getting more mileage out of scarce GPU resources"

2

. The deterministic nature of Engram's retrievalβ€”dependent solely on input token sequences rather than runtime hidden statesβ€”enables prefetch-and-overlap strategies that standard MoE dynamic routing cannot achieve

2

.

By committing knowledge libraries to common system DRAM standards like CXL, Engram addresses the ongoing reliance on High-Bandwidth Memory (HBM) that constrains AI infrastructure even for Chinese silicon like Huawei's Ascend series

1

. This separation allows GPU HBM to dedicate itself to reasoning while static memory operates independently, enabling more performant models as organizations face memory supply squeezes and mounting infrastructure costs

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Β© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo