2 Sources
[1]
Uncle Sam's next big supercomputer might use something more exotic than GPUs
Of the world's most powerful supercomputers, nine of the top 10 are powered by GPUs, but that might not be the case for much longer. As chipmakers like Nvidia prioritize AI FLOPS over the ultra-precise floating point calculations used in scientific computing, US National Labs are turning to new chip architectures to get their FP64 fix. Among the candidates is NextSilicon's Maverick-2, a dataflow processor designed explicitly with the 64-bit floating point mathematics that dominate the Department of Energy's most important simulations. Despite its name, the Department of Energy is concerned with far more than the US' power grid. It operates some of the largest publicly known supercomputers in the world, which are responsible for everything from simulating the physics of nuclear weapons at the moment of criticality and bioweapons defense to public health and safety. Since the Titan Supercomputer made its debut in 2012, a growing number of these supercomputers have been powered by GPUs from Nvidia, and more recently AMD. But that's not the case for Sandia National Laboratory's new Spectra supercomputer, which was built in collaboration with Penguin Solutions and NextSilicon. Compared to exascale systems like Frontier or El Capitan, Spectra is tiny. The machine counts 64 nodes and 128 of NextSilicon's "runtime-configurable" accelerators. But scale isn't the point. Spectra is a test bed for NextSilicon's Maverick-2. This week, Sandia gave the chips the thumbs up, announcing that the big iron had met all of its system acceptance requirements, opening the door for the chips to be deployed in larger systems in the future. Despite some similarities to Nvidia's B200, Maverick-2 is a very different beast. Instead of the standard von Neumann compute architecture that underpins most CPUs and GPUs today, NextSilicon's chips employ a reconfigurable dataflow architecture. The processor's two compute dies comprise a grid of arithmetic logic units interconnected in a graph. Each unit is configured at runtime to perform a specific operation, whether it be addition, multiplication, or some other logic operation. But the chip's real trick is overlapping data flow and compute. As soon as data reaches the next unit in the pipeline, it's computed immediately, no waiting for load-store operations to shuffle data around. According to NextSilicon, this dramatically improves the performance and efficiency of the chips in real-world workloads. Dataflow architectures aren't new. Groq, Cerebras, and SambaNova have all built chips based on the concept. However, all of these designs are aimed at AI inference or training. NextSilicon's is one of the few we've seen aimed at HPC. Dataflow is notoriously difficult to program for, which is likely why the chip startups that have built chips around it have largely offered them as a managed or white glove service rather than selling bare metal servers. Rather than trying to port workloads to run on its chips, NextSilicon has built a compiler that it claims allows it to run any existing C, Python, Fortran, or CUDA codebases on its chips. As we understand it, it works by initially running these workloads on the CPU. The compiler then captures the compute graph, maps it to the chips, and then optimizes it to maximize performance. With Spectra, Sandia has now validated the parts across three key workloads: the high-performance conjugate gradient (HPCG) benchmark, the LAMMPS molecular dynamics test suite, and the Sparta Monte Carlo simulation suite. NextSilicon's focus on HPC comes in stark contrast to the next generation of GPUs from Nvidia. The company's Rubin GPUs due out later this year promise gobs of memory bandwidth and up to 50 petaFLOPS of FP4 compute. This makes the chips strong contenders for AI inference and training workloads, which is probably why the DoE is also deploying them in systems like the Doudna supercomputer at Lawrence Berkeley National Laboratory. While FP64 compute remains relevant for many existing scientific workloads, for AI workloads, Nvidia's GPUs are still relevant to US Labs. However, all those AI FLOPS come at the expense of hardware FP64 vector and matrix performance. Rubin tops out at 33 teraFLOPS, making it slower than even Nvidia's nearly four-year-old H100. But that's not to say it's not good for scientific computing. For matrix heavy workloads like High Performance Linpack (HPL), Nvidia is leaning on a somewhat controversial spin on the Ozaki scheme, which uses lower precision data types to emulate FP64 compute. Using this approach, Nvidia claims Rubin can deliver up to 200 teraFLOPS of FP64 matrix performance. We dug deeper into Nvidia's emulated FP64 algorithms earlier this year, but suffice to say it's not perfect. While it has shown promise in certain HPC workloads, in others, particularly vector-heavy ones, like computational fluid dynamics, it offers little if any benefit. Coincidentally, the latter happens to be the same kind of workload that NextSilicon has focused its attention on. We don't yet have system-level benchmarks for NextSilicon's hardware, much less Spectra, but we're told a single Maverick-2 can deliver about 600 gigaFLOPS of FP64 compute HPCG. The startup claims this performance is roughly on par with leading GPUs while consuming half the power. While Nvidia is clearly prioritizing AI compute in its latest generation of GPUs, AMD has taken a different approach. Like Rubin, AMD's new MI455X accelerators are tuned for AI inference and training, but it's only one of several versions of the GPU the House of Zen has baked in TSMC's oven. For the MI430X, AMD swapped out the AI-centric compute dies for some built specifically for HPC. Earlier this month, we learned the chip would deliver up to 200 teraFLOPS of peak FP64 grunt to the DoE's upcoming Discovery and Europe's Alice Recoque supercomputers. Chip startups like NextSilicon still need to prove their chips can scale to larger systems. But, across the Pacific, China has already shown that, at least for scientific computing, it doesn't need GPUs to compete with the West's best supers. China has a history of building boutique silicon specifically to advance its national supercomputing capability. Some systems, like the Sunway TaihuLight supercomputer, used a custom manycore processor like 260 custom RISC processors. Others, like the Tianhe 2A, used a homegrown digital signal processor (DSP) called the Matrix 2000 for its FP64 compute. More recently, we caught wind of a new supercomputer, called the LineShine, that, similar to the TaihuLight machine, reportedly uses 47,000 custom CPUs, which are expected to push the machine to 2 exaFLOPS of FP64 grunt. Of course, because China doesn't participate in the annual Top500 ranking of the fastest publicly known supers anymore, we may never know for sure. China's use of boutique silicon is due in part to US trade restrictions on the sale of high-end accelerators in the region. Even where still legal, these chips have become a supply chain vulnerability for Beijing. In fact, the US government's decision to bar Intel from selling its Xeon Phi processors to China drove the development of the Matrix 2000. In the US, the bigger challenge may be competing with chip designers' shareholders. AI has made Nvidia the most valuable company in the world; HPC by comparison remains an important, albeit niche market. ®
[2]
As chip industry chases AI, U.S. national labs look to newcomers for supercomputers
ALBUQUERQUE, New Mexico, May 18 (Reuters) - In a nondescript building on Kirtland Air Force Base on the high desert of New Mexico, liquid-cooled supercomputers gurgle and hum their way through some of the most complex math problems the U.S. government seeks to solve: simulating how hypersonic nuclear weapons would move through the earth's atmosphere, or what would happen if one nuclear warhead detonated near another. For more than a decade, the chips handling this secretive and demanding work came from mainstream semiconductor firms like Nvidia (NVDA.O), opens new tab or Advanced Micro Devices (AMD.O), opens new tab. But with those companies increasingly designing their chips for artificial intelligence and facing supply shortages, the managers in charge of the systems at Sandia National Laboratories, which operates the machines at Kirtland and is one of three U.S. labs tasked with developing and maintaining the nation's nuclear weapons arsenal, are increasingly unsure how they will find computing power for high-precision scientific work like theirs. "The pressure we're feeling right now is on the computing front and also from the supply chain," said â Steve Monk, the manager of Sandia's high-performance computing team, explaining the challenge of getting chips that meet his needs. "Looking to the future, it's a bit stressful in terms of our ability to deliver to the mission." NEW ENTRANTS INTO CHIP MARKET The lab's predicament shows how the race for better AI chips is having the unintended consequence of opening markets once dominated by the big firms to smaller players such as NextSilicon, an Israeli startup whose chips are being tested by a program at Sandia. It also shows the role that Sandia, which worked with Nvidia extensively as the company rose to prominence in supercomputing and is still collaborating with Nvidia on new memory technology, plays in incubating and shaping new computing technologies. One major concern for officials at Sandia is what is known as double-precision floating point computation, a technical term for being able to compute both very large and very small numbers without losing accuracy to rounding errors. For years, Nvidia and AMD pursued leadership in speeding up that kind of computing, landing supercomputing contracts with universities and government labs. But AI work does not benefit from double-precision computing in the same way as simulating physics problems. While AMD is releasing a version of its chips aimed at â scientific computing, the double-precision performance of Nvidia's forthcoming Rubin chips has declined by some measures, worrying many scientists in the high-performance computing industry, said Ian Cutress, chief analyst at More Than Moore, a chip consulting firm. Daniel Ernst, senior director of supercomputing products at Nvidia, said the company remains committed to scientific computing, aiming to create a balanced chip that can run real-world scientific applications alongside AI work. But the shifting chip market has prompted officials at Sandia to test products from newcomers such as NextSilicon, whose chip uses a completely different computing approach than graphics processing units (GPUs) or central processing units (CPUs) from Nvidia and AMD. NUCLEAR SECURITY WORK On Monday, Sandia, NextSilicon and Penguin Solutions, â the firm that helped weave NextSilicon's chips into a supercomputer, said the systems have passed a key technical milestone using a battery of general supercomputing tests that put the chips in the running for use in government systems. That sets up NextSilicon's chips for a decision this fall on whether to start testing the chips with more demanding computing problems that closely resemble the kind of nuclear security work they would eventually have to handle. The NextSilicon chips â can perform double-precision computing and are also designed to reprogram themselves on the fly to run more efficiently. NextSilicon's chips save electricity by using what is known as a data flow architecture that spends less time and energy shuffling data back and forth to the computing system's memory. Sandia's work with chip firms often helps technology become widespread. Liquid cooling systems for chips were an exotic idea when Sandia started urging Intel, â AMD and Nvidia to work on the technology more than a decade ago, and now they are common. James Laros, a senior scientist at Sandia who oversees a program to test new computing architectures at Sandia, said the work with smaller players like NextSilicon is aimed at ensuring Sandia can always procure the chips it needs, even if major chip firms shift focus. "We have to keep available options to complete our mission, because the mission is not optional," Laros said. Reporting by Stephen Nellis in Albuquerque, New Mexico; editing by Peter Henderson and Nick Zieminski Our Standards: The Thomson Reuters Trust Principles., opens new tab
Share
Copy Link
As the chip industry chases AI, US national labs face a supply crunch for high-precision computing. Sandia National Laboratories is testing NextSilicon's dataflow processors as alternatives to GPUs for supercomputers, marking a shift in how nuclear weapons simulations and critical scientific work gets done. The move signals growing concerns about Nvidia's declining focus on FP64 performance.
The race to build better AI chips is reshaping the supercomputing landscape in ways few anticipated. Nine of the world's top 10 most powerful supercomputers rely on GPUs, but US national labs are now exploring alternatives to GPUs for supercomputers as chipmakers like Nvidia prioritize AI workloads over the ultra-precise calculations needed for scientific computing
1
. This shift has opened the door for new chip architectures from smaller players to challenge the dominance of established semiconductor giants.At Sandia National Laboratories, located on Kirtland Air Force Base in New Mexico, liquid-cooled supercomputers handle some of the most complex and secretive work the US government undertakes: simulating nuclear weapons as they move through the atmosphere and modeling what happens when nuclear warheads detonate near each other
2
. For over a decade, mainstream semiconductor firms supplied the chips for this demanding work, but supply shortages and shifting design priorities are forcing labs to rethink their procurement strategies.Sandia recently validated NextSilicon's Maverick-2, a dataflow processor designed explicitly for the 64-bit floating point mathematics that dominate High-Performance Computing (HPC) applications at the Department of Energy. The Spectra supercomputer, built in collaboration with Penguin Solutions and NextSilicon, met all system acceptance requirements this week, opening the door for deployment in larger systems
1
. While Spectra is modest in scaleâcounting just 64 nodes and 128 of NextSilicon's runtime-configurable acceleratorsâit serves as a crucial test bed for proving the technology works.
Source: Reuters
The Israeli startup's chips passed validation using three key workloads: the high-performance conjugate gradient benchmark, the LAMMPS molecular dynamics test suite, and the Sparta Monte Carlo simulation suite
1
. This technical milestone puts NextSilicon in the running for government systems and sets up a decision this fall on whether to test the chips with more demanding computing problems that closely resemble actual nuclear security work2
.The concern driving this search for new options centers on double-precision floating point computationâFP64âwhich allows computers to handle both very large and very small numbers without losing accuracy to rounding errors. This capability is essential for simulating nuclear weapons and other physics problems
2
. But AI work doesn't benefit from high-precision computing the same way scientific applications do, leading chipmakers to deprioritize it.Nvidia's forthcoming Rubin GPUs illustrate this tension. The chips promise up to 50 petaFLOPS of FP4 compute for AI inference and training, but their FP64 hardware performance tops out at just 33 teraFLOPSâslower than the nearly four-year-old H100
1
. While Nvidia claims Rubin can deliver up to 200 teraFLOPS of FP64 matrix performance using emulated approaches based on the Ozaki scheme, this method has limitations in vector-heavy workloads like computational fluid dynamics."The pressure we're feeling right now is on the computing front and also from the supply chain," said Steve Monk, manager of Sandia's high-performance computing team. "Looking to the future, it's a bit stressful in terms of our ability to deliver to the mission"
2
.Related Stories
Unlike the von Neumann architecture that underpins most CPUs and GPUs, the Maverick-2 employs a reconfigurable dataflow processor design. The chip's two compute dies comprise a grid of arithmetic logic units interconnected in a graph, with each unit configured at runtime to perform specific operations
1
. The real advantage comes from overlapping data flow and computeâas soon as data reaches the next unit in the pipeline, it's computed immediately without waiting for load-store operations to shuffle data around.This approach dramatically improves performance and efficiency in real-world workloads while saving electricity by spending less time and energy moving data back and forth to memory
1
2
. While dataflow architectures aren't newâcompanies like Groq, Cerebras, and SambaNova have built similar designsâNextSilicon is one of the few targeting HPC rather than AI workloads.To address the notorious difficulty of programming dataflow systems, NextSilicon built a compiler that runs existing C, Python, Fortran, or CUDA codebases on its chips by capturing the compute graph and optimizing it for maximum performance
1
.Sandia's work with NextSilicon reflects a broader strategy to ensure mission-critical capabilities remain available even as major chip firms shift focus. James Laros, a senior scientist at Sandia who oversees testing of new computing architectures, emphasized this imperative: "We have to keep available options to complete our mission, because the mission is not optional"
2
.
Source: The Register
The lab has a history of incubating technologies that later become widespread. Liquid cooling systems for chips were exotic when Sandia started urging Intel, AMD, and Nvidia to develop the technology over a decade ago; now they're common
2
. This pattern suggests that today's experimental dataflow processors could influence tomorrow's mainstream scientific computing infrastructure, particularly if the chip industry chases AI at the continued expense of scientific computing needs.Summarized by
Navi
26 Nov 2025â¢Technology

18 Nov 2025â¢Science and Research

13 Jun 2025â¢Technology

1
Science and Research

2
Technology

3
Policy and Regulation
