Tensordyne Napier AI Chip vs Nvidia Blackwell

Tensordyne Takes Aim at Nvidia with Logarithmic Computing

AI chip startup Tensordyne has announced the successful tape-out of its Napier AI chip, marking a significant milestone in its quest to outperform Nvidia Blackwell in AI inference workloads 1

. Built on TSMC's 3nm process in collaboration with Broadcom and HPE's Juniper Networks, the chip is now in production with beta deployment scheduled for Q1 2027 and commercial shipments expected by the end of Q2 2027 2

. The company has already secured over $200 million in projected system demand, signaling strong early interest in its unconventional approach to AI acceleration 3

Source: The Register

What sets Tensordyne apart is its use of logarithmic matrix multiplication, a mathematical technique that transforms multiplication operations into simpler addition problems. By leveraging the principle that the logarithm of A times B equals the logarithm of A plus the logarithm of B, the company has effectively turned multipliers into adders. "We've turned multipliers into adders," explains Gilles Backhus, Tensordyne founder and vice president of AI 1

. Since adders are smaller and more power-efficient than multiplication circuits, this approach allows Napier to pack significantly more compute into a smaller silicon area while consuming less power.

Breaking Through the Conversion Barrier

The concept of using logarithms for computation isn't new, but previous attempts failed because converting between logarithmic numbers and the floating-point numbers used in neural networks introduced too much latency, energy overhead, and accuracy loss. Tensordyne claims to have solved this fundamental challenge. "So far no one has figured out how to do the linear to logarithm and logarithm to linear conversion as we have," Backhus states 1

. The company uses the Mitchell approximation as a heuristic to estimate log and antilog values, combined with a section-wise correction mechanism implemented in hardware that delivers accuracy equivalent to FP16 2

. This elegant solution avoids the impracticality of look-up tables while maintaining the precision required for large language models.

The Napier AI chip itself packs 138 billion transistors and delivers 2.1 petaflops of dense FP8 compute performance 3

. Each chip features 144GB of HBM3E memory spread across four stacks, providing 4.7TB/s of memory bandwidth, along with 256MB of SRAM—five times more on-chip SRAM than Nvidia's Blackwell 4

. Operating at a 300-watt TDP, Napier uses nearly 60 percent less power than Nvidia's H200 accelerators while delivering comparable specifications 2

. The chip also supports FP8 and 4-bit block floating datatypes, providing flexibility for different model requirements.

The TDN72 System Architecture

Source: IEEE

Tensordyne's full rack configuration, the TDN72 system, houses 288 Napier chips across four pods of 72 chips each 3

. Each pod consists of eight air-cooled compute blades, with each blade containing a single 10-core Intel Xeon-D host CPU and nine Napier accelerators 2

. The complete rack delivers 608 petaflops of FP8 compute and 42TB of HBM3E memory while operating within a 120-kilowatt power envelope—all without requiring liquid cooling 3

The chips connect through a proprietary interconnect called TDN Link, which delivers sub-microsecond chip-to-chip latency with 1TB/s of bandwidth across the 72-chip system 3

. Each chip connects to six proprietary fabric switch blades developed by Juniper Networks in an all-to-all fabric topology reminiscent of Nvidia's GB200 NVL72 rack systems 2

. Despite similarities, the TDN72 is significantly more compact—up to four 30kW systems can fit into a 52U rack, making deployment in older brownfield datacenters more feasible.

Targeting Both Prefill and Decode

As AI inference workloads become more critical than training, particularly with the rise of AI agents, companies are optimizing for the two main stages of executing large language models: prefill and decode. Prefill is computationally intensive, converting input text into tokens and building the key-value cache. Decode generates output tokens sequentially, making it more dependent on memory and network latency than raw compute power 1

. While Nvidia is touting split architectures—B300 GPUs for prefill and Groq 3 processors for decode—Tensordyne claims its system can handle both jobs efficiently. "We're optimizing for two hard challenges here at the same time," says R.K. Anand, chief product officer and co-founder of Tensordyne. "We're the first company proving that you can do both without going to multiple vendors and multiple racks" 1

For a 2-trillion parameter model, Tensordyne claims a 4-pod rack would deliver 1,300 tokens per second per user at a cost of $11 per million tokens while consuming 120 kilowatts of power 1

. The company asserts the TDN72 provides 13x more tokens per second and 17x more tokens per watt than Nvidia's Blackwell NVL72 4

. For multi-trillion parameter models, Tensordyne claims a single rack can match the throughput of nine Nvidia Vera Rubin plus Groq LPX racks, delivering 1,000 tokens per second per user 4

. These performance claims translate to up to $33 million more annual revenue per rack compared to competing solutions 4

Software Maturity and Market Timing

Tensordyne has worked to simplify software deployment since building its first prototype silicon. Early prototypes lacked error correction and required quantization-aware training, making them impractical for trillion-parameter models 2

. The current platform promises compatibility with Hugging Face-hosted models, PyTorch, and Triton, along with a custom Python SDK that can convert existing models to run directly on the hardware 3

However, the standard caveats apply. These performance figures come from simulations, and real systems won't be available until late 2027 to verify the claims 1

. By the time Tensordyne ships, it will compete against Nvidia's next-generation Vera Rubin and Vera Rubin Ultra systems, presenting a stiffer challenge, especially regarding software compatibility 2

. The 3nm AI accelerator landscape will also feature competition from AMD and a growing field of inference-focused silicon startups 3

. For organizations deploying AI infrastructure, power efficiency and cost-per-token metrics will determine whether Tensordyne's logarithmic approach can disrupt Nvidia's dominance in a market where software ecosystems often matter as much as raw performance.

Tensordyne's Napier AI chip claims 13x faster inference than Nvidia Blackwell using log math

Tensordyne Takes Aim at Nvidia with Logarithmic Computing

Breaking Through the Conversion Barrier

The TDN72 System Architecture

Targeting Both Prefill and Decode

Software Maturity and Market Timing

References

Tensordyne's Wild Log Math Aims to Leave Nvidia's AI Chips In the Dust

Tensordyne makes a big bet on log math to beat Nvidia

US AI startup Tensordyne claims 3nm Napier chip outperforms NVIDIA Blackwell by 13x in tokens per second

Tensordyne's 3nm Napier AI Chip Promises 13x Higher Token Throughput Than Blackwell & Blazes Past Rubin With 1000 Tokens/s In Multi-Trillion Parameter Models

Related Stories

Broadcom partners with FuriosaAI on 2nm AI accelerator chip with HBM4e memory for inference

Nvidia's $20B Groq bet and Vera Rubin platform reveal how AI inference is splitting the GPU era

NVIDIA Blackwell Ultra slashes AI inference costs by 35x while delivering 50x better performance

Recent Highlights

OpenAI rogue agent compromised multiple services in unprecedented AI security breach

AI Kill Switch Act gives DHS power to shut down rogue AI systems after OpenAI security breach

Nvidia forms Open Secure AI Alliance with Microsoft, but OpenAI, Google and Anthropic sit out

Recent Highlights

Today's Top Stories

Trump administration bans Chinese robots and inverters to protect US AI infrastructure

AI company employees ask US government for tools to slow down AI development after security breach

Anthropic AI cracks post-quantum cryptography and finds faster AES attack autonomously

Apple's Siri AI-powered smart home hub is finally ready after years waiting on assistant upgrade