2 Sources
[1]
Tensordyne's Wild Log Math Aims to Leave Nvidia's AI Chips In the Dust
Tensordyne's Napier pods fit 72 of its new AI chips in a system that takes up one-quarter of a server rack. If simulations are to be believed, startup Tensordyne's new AI chip could crush the performance of market leader Nvidia in terms of energy efficiency and latency for inferencing. The company just sent the plans for its first chip to be manufactured, with commercial sales of a 72-chip system scheduled for the second half of 2027. Tensordyne claims its 72-chip system can run large LLMs four times as fast using one-fifth the power compared to a 72-Nvidia GB300 system. However, real systems won't be around to back these figures up until the end of the year. The not-so-secret sauce behind the outsized efficiency of Tensordyne's new chip, Napier, is how it does matrix multiplication, the main math of AI. It takes advantage of the fact that the logarithm of A times B equals the logarithm of A plus the logarithm of B. "We've turned multipliers into adders," explains Gilles Backhus, a Tensordyne founder and vice president of AI. Adders are smaller and more energy efficient logic circuits than those that do multiplication, he says. So Napier can pack more compute into a smaller area and still save on power. New kinds of numbers That such a thing was possible has long been known, but there wasn't a good way to use it, because converting back and forth between logarithmic numbers and the floating point numbers that describe neural networks took too much time and energy and introduced too many inaccuracies. Not anymore, according to Backhus. "So far no one has figured out how to do the linear to logarithm and logarithm to linear conversion as we have," he says. "And that's actually the crux of that whole thing. Our engineers have figured out ways to do this very elegantly and very very accurately and cheaply on silicon." The importance of number formats hasn't been lost on the AI industry. Speaking at IEEE Hot Chips in 2023, Nvidia chief scientist attributed the majority of the improvement in the company's GPUs at the time to the use of shorter number formats and the smaller circuits they require. Researchers have also worked on circuits to compute with alternative formats, such as the logarithm-like posit and more recently its scientific-computing counterpart the takum. However, these formats have not reached mainstream adoption mostly because their hardware implementation is so different from traditional floating point. Inference Demands Influence Architecture Market trends, including the rise of AI agents, mean inference -- the execution of neural network models -- is becoming more important than training new large-language models. Factors like the cost and the speed at which answers are delivered are starting to dominate, and that's led AI companies to look for system architectures that are a better fit for that. Tensordyne executives say they saw this coming and engineered their computers to meet it. There are two main parts to executing an LLM: prefill and decode. In the prefill stage the model takes in the input text and turns it into tokens, the basic units it can work with, and builds a kind of working memory about the input, called the key value cache. It's a computationally heavy task. Decode is where the LLM generates its output tokens, the answer or response to your input. Each new token is predicted using the previous token and the key-value cache. This sequential nature can make decode a slower process, and it's more dependent on memory and network latency than computing power. So, AI chip makers are starting to build systems with those two different demands in mind. Nvidia is touting a system where a server rack full of B300 GPUs handles prefill and several racks of its Groq 3 processors do the decode. Amazon Web Services is combining a rack of its Trainium AI chips for prefill with several racks of Cerbras's wafer-scale computers for decode. Tensordyne says its system can handle both jobs. "We're optimizing for two hard challenges here at the same time," says R.K. Anand, chief product officer and co-founder of Tensordyne. "We're the first company proving that you can do both without going to multiple vendors and multiple racks." The dense compute needed for prefill comes from the logarithmic math. The needs of decode come from 144-gigabytes of high-bandwidth memory and a custom 1-microsecond-latency network called Tensordyne Napier Link. In a "pod" system that fits in one quarter of a standard rack, Tensordyne packs in 72 Napier chips, 8 Intel Xeon CPUs, and 64 terabytes of solid-state storage. A 4-pod rack working on a 2-trillion parameter LLM would deliver 1300 tokens per second per user at a cost of $11 for 1 million tokens while consuming 120 kilowatts of power, the company claims, with one pod crunching out prefill and 3 working on decode. To get similar tokens per second per user numbers a 9-rack Rubin and Groq 3 system would likely consume 1.5 megawatts, according to Tensordyne. Whether or not these numbers really hold up will have to wait until later in the year. Tensordyne plans to have a beta version available through the cloud for customers to work with. It expects to begin shipping systems to customers about a year from now.
[2]
Tensordyne's 3nm Napier AI Chip Promises 13x Higher Token Throughput Than Blackwell & Blazes Past Rubin With 1000 Tokens/s In Multi-Trillion Parameter Models
US-based AI company, Tensordyne, has announced the successful tape-out of its Napier chip, which it claims to demolish NVIDIA's Blackwell & Rubin chips with leading token throughput and efficiency. Tensordyne's new Napier AI Chip arrives with one clear mission: to make NVIDIA's Blackwell and Rubin chips look considerably less impressive The Napier chip will be the core component of the Tensordyne Napier TDN system, which is designed in collaboration with Broadcom and HPE Juniper Networks. The Napier platform has one goal: to unify AI through novel logarithmic AI math, a tightly integrated memory architecture, and a high-performance scale-up interconnect that drives higher token throughput at low power. Napier is built on TSMC's 3nm process, and with its successful tape-out, the chip is now in production. With the primary milestone achievement, Tensordyne is now working towards beta deployment and a broader infrastructure plan that represents over $200 million in forecasted Napier system demand. And the key area of focus is AI inferencing. We just talked about how current AI infrastructure is constrained by power consumption, but to tackle these constraints, solutions such as 800V DC are going to incur a huge deployment cost. Infrastructures such as power and cooling alone make up 50% of the cost of major AI deployments, and to address these, Tensordyne has come up with a new inference stack across math, compute, memory, and networking: TDN Math (Logarithmic Mathematics) TDN replaces large-scale multiplication operations with simplified addition-based computation, significantly improving performance-per-watt efficiency across frontier AI models. TDN AIP (Artificial Intelligence Processor) Each TDN processor tightly integrates substantial fast SRAM alongside HBM memory, minimizing idle compute cycles and supporting efficient execution of the industry's largest models. TDN Link (Any-to-Any Scale-Up Interconnect) Tensordyne's proprietary scale-up fabric delivers sub-microsecond communication latency between processors, maximizing compute utilization and minimizing interconnect bottlenecks. All of this is brought together in Tensordyne's TDN72 Inference Pod and Rack system. Each Pod is fitted with 72 Napier AI chips, which are composed of NVIDIA's NVL72 rack with 72 Blackwell or Rubin GPUs. It requires way less infrastructure capacity, and a Napier Rack combines for TDN72 pods to deliver: * 17x more tokens per watt (vs NVIDIA Blackwell) * 13x more tokens per second (vs NVIDIA Blackwell) * Up to $33 million more annual revenue per rack Tensordyne doesn't stop at just Blackwell comparison; they also compare the Napier solution against NVIDIA's upcoming Rubin platform. The company claims that its platform supports multi-trillion parameter models with a throughput of 1000 tokens/s per use in a single-rack configuration. To do the same, NVIDIA will require nine Rubin + Groq LPX racks. Tensordyne's Napier platform represents a bold leap forward in AI inference. By delivering 17× more tokens per watt and 13× higher throughput than NVIDIA Blackwell, while matching the performance of nine Rubin-based racks in a single compact footprint, it shatters the traditional speed-versus-cost and power-versus-performance trade-offs. With dramatically lower infrastructure demands, up to $33 million more annual revenue per rack, and efficient scaling for multi-trillion parameter models, Napier doesn't just compete with NVIDIA's Blackwell & Rubin; it redefines what's possible for next-generation AI deployment. Follow Wccftech on Google to get more of our news coverage in your feeds.
Share
Copy Link
AI chip startup Tensordyne has taped out its Napier processor, claiming it delivers 17x more tokens per watt and 13x higher throughput than Nvidia Blackwell. The chip uses logarithmic mathematics to convert energy-intensive multiplication into simple addition. Built on TSMC's 3nm process with Broadcom and HPE, commercial systems ship in late 2027.
AI chip startup Tensordyne has completed the tape-out of its Napier AI chip, marking a significant milestone for a company attempting to disrupt Nvidia's dominance in AI hardware
1
. The chip, built on TSMC 3nm process technology, is now in production with commercial sales of a 72-chip system scheduled for the second half of 20272
. Tensordyne claims its technology delivers 17x more tokens per watt and 13x higher token throughput compared to Nvidia Blackwell, representing a potential shift in how AI inferencing systems are designed and deployed2
.
Source: Wccftech
The core innovation behind the Napier AI chip lies in its approach to matrix multiplication, the fundamental mathematical operation powering large language models. Tensordyne exploits a mathematical principle where the logarithm of A times B equals the logarithm of A plus the logarithm of B. "We've turned multipliers into adders," explains Gilles Backhus, a Tensordyne founder and vice president of AI
1
. Because adders are smaller and more energy efficient AI chip circuits than multipliers, Napier can pack more compute into a smaller area while consuming less power1
. While the concept has been known for years, previous attempts failed because converting between logarithmic numbers and floating point introduced too many inaccuracies and consumed excessive time and energy. Tensordyne claims its engineers have solved this conversion challenge "very elegantly and very very accurately and cheaply on silicon"1
.The Napier platform addresses a critical shift in AI deployment economics as inference workloads increasingly dominate over training. Market trends, including the rise of AI agents, mean the cost and speed at which answers are delivered are starting to matter more than training new models
1
. Tensordyne designed its system to handle both prefill and decode stages of large language models execution. Prefill transforms input text into tokens and builds the key-value cache, a computationally heavy task. Decode generates output tokens sequentially, making it more dependent on memory and network latency than raw computing power . While Nvidia competitor strategies involve separate systems for these tasks—such as B300 GPUs for prefill and Groq 3 processors for decode—Tensordyne claims its platform handles both without requiring multiple vendors and multiple racks1
.
Source: IEEE
Related Stories
The Tensordyne Napier TDN system, designed in collaboration with Broadcom and HPE Juniper Networks, combines three key technologies
2
. Each Napier chip integrates substantial fast SRAM alongside 144 gigabytes of high-bandwidth memory to minimize idle compute cycles1
2
. A custom low-latency network called Tensordyne Napier Link delivers sub-microsecond communication latency between processors, maximizing compute utilization1
2
. A single pod system fitting in one quarter of a standard rack packs 72 Napier chips, 8 Intel Xeon CPUs, and 64 terabytes of solid-state storage1
.Tensordyne's most aggressive claims target both current and future Nvidia platforms. The company states a 4-pod rack working on a 2-trillion parameter LLM would deliver 1300 tokens per second per user at a cost of $11 for 1 million tokens while consuming 120 kilowatts of power
1
. For multi-trillion parameter models, Tensordyne claims its platform supports throughput of 1000 tokens per second per user in a single-rack configuration, a task that would require nine Nvidia Rubin plus Groq LPX racks2
. The company projects up to $33 million more annual revenue per rack compared to Nvidia Blackwell systems2
. However, these figures remain based on simulations, with real systems not expected until the end of 20251
. Tensordyne has achieved beta deployment milestones and reports over $200 million in forecasted Napier system demand2
. Whether the company can deliver on these claims will determine if this AI chip startup can genuinely compete as a Nvidia competitor in the rapidly evolving AI hardware landscape.🟡,Summarized by
Navi
1
Policy and Regulation

2
Business and Economy

3
Technology
