5 Sources
5 Sources
[1]
New NVIDIA Nemotron 3 Super Delivers 5x Higher Throughput for Agentic AI
A new, open, 120-billion-parameter hybrid mixture-of-experts model optimized for NVIDIA Blackwell addresses the costs of long thinking and context explosion that slow autonomous agent workflows. Launched today, NVIDIA Nemotron 3 Super is a 120‑billion‑parameter open model with 12 billion active parameters designed to run complex agentic AI systems at scale. Available now, the model combines advanced reasoning capabilities to efficiently complete tasks with high accuracy for autonomous agents. AI-Native Companies: Perplexity offers its users access to Nemotron 3 Super for search and as one of 20 orchestrated models in Computer. Companies offering software development agents like CodeRabbit, Factory and Greptile are integrating the model into their AI agents along with proprietary models to achieve higher accuracy at lower cost. And life sciences and frontier AI organizations like Edison Scientific and Lila Sciences will power their agents for deep literature search, data science and molecular understanding. Enterprise Software Platforms: Industry leaders such as Amdocs, Palantir, Cadence, Dassault Systèmes and Siemens are deploying and customizing the model to automate workflows in telecom, cybersecurity, semiconductor design and manufacturing. As companies move beyond chatbots and into multi‑agent applications, they encounter two constraints. The first is context explosion. Multi‑agent workflows generate up to 15x more tokens than standard chat because each interaction requires resending full histories, including tool outputs and intermediate reasoning. Over long tasks, this volume of context increases costs and can lead to goal drift, where agents lose alignment with the original objective. The second is the thinking tax. Complex agents must reason at every step, but using large models for every subtask makes multi-agent applications too expensive and sluggish for practical applications. Nemotron 3 Super has a 1‑million‑token context window, allowing agents to retain full workflow state in memory and preventing goal drift. Nemotron 3 Super has set new standards, claiming the top spot on Artificial Analysis for efficiency and openness with leading accuracy among models of the same size. The model also powers the NVIDIA AI-Q research agent to the No. 1 position on DeepResearch Bench and DeepResearch Bench II leaderboards, benchmarks that measure an AI system's ability to conduct thorough, multistep research across large document sets while maintaining reasoning coherence. Hybrid Architecture Nemotron 3 Super uses a hybrid mixture‑of‑experts (MoE) architecture that combines three major innovations to deliver up to 5x higher throughput and up to 2x higher accuracy than the previous Nemotron Super model. * Hybrid Architecture: Mamba layers deliver 4x higher memory and compute efficiency, while transformer layers drive advanced reasoning. * MoE: Only 12 billion of its 120 billion parameters are active at inference. * Latent MoE: A new technique that improves accuracy by activating four expert specialists for the cost of one to generate the next token at inference. * Multi-Token Prediction: Predicts multiple future words simultaneously, resulting in 3x faster inference. On the NVIDIA Blackwell platform, the model runs in NVFP4 precision. That cuts memory requirements and pushes inference up to 4x faster than FP8 on NVIDIA Hopper, with no loss in accuracy. Open Weights, Data and Recipes NVIDIA is releasing Nemotron 3 Super with open weights under a permissive license. Developers can deploy and customize it on workstations, in data centers or in the cloud. The model was trained on synthetic data generated using frontier reasoning models. NVIDIA is publishing the complete methodology, including over 10 trillion tokens of pre- and post-training datasets, 15 training environments for reinforcement learning and evaluation recipes. Researchers can further use the NVIDIA NeMo platform to fine-tune the model or build their own. Use in Agentic Systems Nemotron 3 Super is designed to handle complex subtasks inside a multi-agent system. A software development agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation. In financial analysis it can load thousands of pages of reports into memory, eliminating the need to re-reason across long conversations, which improves efficiency. Nemotron 3 Super has high-accuracy tool calling that ensures autonomous agents reliably navigate massive function libraries to prevent execution errors in high-stakes environments, like autonomous security orchestration in cybersecurity. Availability NVIDIA Nemotron 3 Super, part of the Nemotron 3 family, can be accessed at build.nvidia.com, Perplexity, OpenRouter and Hugging Face. Dell Technologies is bringing the model to the Dell Enterprise Hub on Hugging Face, optimized for on-premise deployment on the Dell AI Factory, advancing multi-agent AI workflows. HPE is also bringing NVIDIA Nemotron to its agents hub to help ensure scalable enterprise adoption of agentic AI. Enterprises and developers can deploy the model through several partners: * Cloud Service Providers: Google Cloud's Vertex AI and Oracle Cloud Infrastructure, and coming soon to Amazon Web Services through Amazon Bedrock as well as Microsoft Azure. * NVIDIA Cloud Partners: Coreweave, Crusoe, Nebius and Together AI. * Inference Service Providers: Baseten, CloudFlare, DeepInfra, Fireworks AI, Inference.net, Lightning AI, Modal and FriendliAI. * Data Platforms and Services: Distyl, Dataiku, DataRobot, Deloitte, EY and Tata Consultancy Services. The model is packaged as an NVIDIA NIM microservice, allowing deployment from on-premises systems to the cloud. Stay up to date on agentic AI, NVIDIA Nemotron and more by subscribing to NVIDIA AI news, joining the community, and following NVIDIA AI on LinkedIn, Instagram, X and Facebook.
[2]
Nvidia's new open weights Nemotron 3 super combines three different architectures to beat gpt-oss and Qwen in throughput
Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chats -- threatening their cost-effectiveness in handling enterprise tasks. But today, Nvidia sought to help solve this problem with the release of Nemotron 3 Super, a 120-billion-parameter hybrid model, with weights posted on Hugging Face. By merging disparate architectural philosophies -- state-space models, transformers, and a novel "Latent" mixture-of-experts design -- Nvidia is attempting to provide the specialized depth required for agentic workflows without the bloat typical of dense reasoning models, and all available for commercial usage under mostly open weights. At the core of Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precision reasoning. The model utilizes a Hybrid Mamba-Transformer backbone, which interleaves Mamba-2 layers with strategic Transformer attention layers. To understand the implications for enterprise production, consider the "needle in a haystack" problem. Mamba-2 layers act like a "fast-travel" highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a massive 1-million-token context window without the memory footprint of the KV cache exploding. However, pure state-space models often struggle with associative recall. To fix this, Nvidia strategically inserts Transformer attention layers as "global anchors," ensuring the model can precisely retrieve specific facts buried deep within a codebase or a stack of financial reports. Beyond the backbone, the model introduces Latent Mixture-of-Experts (LatentMoE). Traditional Mixture-of-Experts (MoE) designs route tokens to experts in their full hidden dimension, which creates a computational bottleneck as models scale. LatentMoE solves this by projecting tokens into a compressed space before routing them to specialists. This "expert compression" allows the model to consult four times as many specialists for the exact same computational cost. This granularity is vital for agents that must switch between Python syntax, SQL logic, and conversational reasoning within a single turn. Further accelerating the model is Multi-Token Prediction (MTP). While standard models predict a single next token, MTP predicts several future tokens simultaneously. This serves as a "built-in draft model," enabling native speculative decoding that can deliver up to 3x wall-clock speedups for structured generation tasks like code or tool calls. For enterprises, the most significant technical leap in Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By pre-training natively in NVFP4 (4-bit floating point), Nvidia has achieved a breakthrough in production efficiency. On Blackwell, the model delivers 4x faster inference than 8-bit models running on the previous Hopper architecture, with no loss in accuracy. In practical performance, Nemotron 3 Super is a specialized tool for agentic reasoning. It currently holds the No. 1 position on the DeepResearch Bench, a benchmark measuring an AI's ability to conduct thorough, multi-step research across large document sets. It also demonstrates significant throughput advantages, achieving up to 2.2x higher throughput than gpt-oss-120B and 7.5x higher than Qwen3.5-122B in high-volume settings. The release of Nemotron 3 Super under the Nvidia Open Model License Agreement (updated October 2025) provides a permissive framework for enterprise adoption, though it carries distinct "safeguard" clauses that differentiate it from pure open-source licenses like MIT or Apache 2.0. The license includes two critical termination triggers that production teams must monitor: This structure allows Nvidia to foster a commercial ecosystem while protecting itself from "IP trolling" and ensuring that the model isn't stripped of its safety features for malicious use. The release has generated significant buzz within the developer community. Chris Alexiuk, a Senior Product Research Enginner at Nvidia, heralded the launch on X under his handle @llm_wizard as a "SUPER DAY," emphasizing the model's speed and transparency. "Model is: FAST. Model is: SMART. Model is: THE MOST OPEN MODEL WE'VE DONE YET," Chris posted, highlighting the release of not just weights, but 10 trillion tokens of training data and recipes. The industry adoption reflects this enthusiasm: As Kari Briski, Nvidia VP of AI Software, noted: "As companies move beyond chatbots and into multi-agent applications, they encounter... context explosion." Nemotron 3 Super is Nvidia's answer to that explosion -- a model that provides the "brainpower" of a 120B parameter system with the operational efficiency of a much smaller specialist. For the enterprise, the message is clear: the "thinking tax" is finally coming down.
[3]
Nvidia's Nemotron Super 3 model for agentic systems launches with five times higher throughput - SiliconANGLE
Nvidia's Nemotron Super 3 model for agentic systems launches with five times higher throughput With so much talk about its upcoming Vera Rubin graphics processing units, it's easy to forget that Nvidia Corp. doesn't just supply the hardware for artificial intelligence. It also develops its own series of AI models, and today it announced the availability of its most capable model so far. The company said Nemotron Super 3 is aimed at running complex agentic AI systems at large scale, combining advanced reasoning skills with rapid processing speeds to efficiently perform tasks that require extreme accuracy. Nemotron Super 3 is a 120 billion-parameter open model based on a hybrid mixture-of-experts architecture. It combines three innovations to achieve up to five times higher throughput and twice the accuracy of the previous-generation Nemotron Super model, Nvidia said. According to Nvidia, Nemotron Super 3 is designed to tackle two major constraints facing agentic AI systems that aim to automate complex tasks on behalf of their users. The first is an explosion of content. Nvidia said that multi-agent workflows typically generate up to 15-times more tokens than standard chat interactions, because each time a user interacts with one, the model needs to resend context including tool outputs and intermediate reasoning. The second constraint is known as the "thinking tax". Complex agents must reason at each step of a task they complete, which means it's impractical to use much larger models - the more parameters there are, the more expensive it becomes to process things. LLMs are also slower than smaller models. To get around these problems, Nemotron 3 Super has a one million token context window that allows it to retain full workflow state in memory and prevent "goal drift," Nvidia said. Moreover, only 12 billion of its 120 billion parameters are active during inference, which is the process of running trained models to generate predictions or produce conclusions on new, unseen data. Nvidia said Nemotron Super 3 runs in NVFP4 precision on its Blackwell GPUs, which allows it to reduce its memory requirements and speed up inference by up to four-times what can be achieved on its previous-generation Hopper platform. Nemotron 3 Super is extremely accessible, and can be downloaded from build.nvidia.com, OpenRouter and Hugging Face. In addition, the AI search engine Perplexity Inc. is making the model available in its search engine, and also with its "Computer" AI agent system. Generative AI coding applications such as CodeRabbit, Factory and Greptile are also adding the model to their lineups, while the life sciences organizations Edison Scientific and Lila Sciences will use it to power agents for data science, deep literature research and molecular understanding. Companies including Amdocs group Co., Palantir Technologies Inc., Cadence Design Systems Inc. and Dassault Systèmes SA are also using Nemotron Super 3 to automate workflows in telecommunications, cybersecurity, semiconductor design and manufacturing. Finally, Dell Technologies Inc. and Hewlett-Packard Enterprise Co. will also offer access to the model through their respective agent hubs, Nvidia said. The launch of Nemotron 3 Super comes ahead of Nvidia's annual GTC conference, which is set to kick off next week on March 16, where the company is expected to unveil its next generation GPU platforms.
[4]
Nvidia launches 120B parameter Nemotron 3 Super open model
Nvidia launched Nemotron 3 Super, a 120-billion-parameter open-weight model designed for large-scale agentic AI systems. The company announced the release on Wednesday, positioning the model for speed and efficiency. The model targets enterprise deployment for complex tasks requiring long-context processing. It is available immediately on several major platforms, expanding access for developers and researchers. Nemotron 3 Super is available on build.nvidia.com, Perplexity, OpenRouter, and Hugging Face. Enterprises can access it via Google Cloud's Vertex AI, Oracle Cloud Infrastructure, and soon on Amazon Bedrock and Microsoft Azure. The model is also accessible on Coreweave, Crusoe, Nebius, and Together AI. The model uses a hybrid latent mixture-of-experts and Mamaba-Transformer architecture. This structure allows it to use four times as many expert specialists during inference as previous models at the same cost. Nvidia trained the model on synthetic data from other frontier reasoning models. The company is publishing over 10 trillion tokens of training data and 15 training environments with this release. According to Artificial Analysis benchmarks, the model scores 36 for overall intelligence. This places it above gpt-oss-120B at 33 points but behind Gemini 3.1 Pro and GPT-5.4, which both score 57. Find the best AI model for your needs, check out our AI Model Leaderboard! The model achieves a speed of 478 output tokens per second, making it the fastest model available. Nvidia states it achieves 7.5x higher inference throughput than Qwen3.5-122B. Nvidia has not announced a release date for Nemotron 3 Ultra, the family's largest 500-billion-parameter model. The company teased the Ultra variant in its original announcement last year.
[5]
NVIDIA Unveils Nemotron 3 Super as an Open Agentic AI Model, and It Could Be the Perfect Choice for OpenClaw
NVIDIA's Nemotron class of open-source LLMs just got significantly enhanced with the latest release, Nemotron 3 Super, which now targets agentic AI workloads with its extensive context window. For those unaware, when we talk about the leading contributors to the world of open-source AI models, some might think of Chinese AI labs like Kimi or Qwen, but in reality, NVIDIA's Nemotron suite leads in this way. As AI is distributed across a "five-layer" cake, NVIDIA has not only dominated infrastructure and chips but is also one of the few in the West to have heavily invested in open-source models. With that, NVIDIA has now unveiled the Nemotron 3 Super, with the main idea being to run agentic AI applications at scale, making it ideal for agents like OpenClaw. One of the standout aspects of Nemotron 3 Super is NVIDIA's hybrid Mamba-MoE architecture. Compared to traditional MoE models, Mamba is a really impressive implementation. Essentially, NVIDIA has changed how an LLM interprets the data flow. With the newer architecture, Mamba relies on the State Space Model (SSM) to read data linearly, preventing a large context window from being built up and including irrelevant information. Mamba-MoE allows Nemotron 3 Super to maintain an optimal context window for user workloads, yielding the best agentic responses. - NVIDIA The Mamba layers deliver 4x higher memory efficiency and advanced reasoning, making Nemotron 3 Super ideal for inference workloads. Another impressive feature of Nemotron 3 Super is a 1-million-token context window, which is 4 times the size of the one in Kimi 2.5. There's a common law within agentic systems: the bigger the window, the better the response. This is why, from this aspect alone, Nemotron 3 Super dominates all other open-source LLMs and even comes close to the likes of Opus 4.5, despite being limited to just 120 billion parameters. Speaking of OpenClaw, NVIDIA tested Nemotron 3 Super on PinchBench, a suite used to evaluate agent workloads, and the model scored 85.6% across the full test suite, surpassing Opus 4.5, Kimi 2.5, and GPT-OSS 120b. For consumers running extensive workloads through OpenClaw, Nemotron 3 Super has opened up an entirely new class of performance, with compute power requirements that can be met with just a single GPU. Nemotron 3 Super is just an example of how extensive agentic AI systems would actually become moving ahead, and interestingly, LLMs are now overcoming compute limitations as well, which is why the future of model deployment on edge is brighter than ever.
Share
Share
Copy Link
NVIDIA unveiled Nemotron 3 Super, a 120-billion-parameter open weights AI model designed to tackle the challenges of multi-agent systems. The hybrid mixture-of-experts architecture delivers 5x higher throughput while addressing context explosion and the thinking tax that plague autonomous agent workflows. Available now on platforms like Hugging Face and Perplexity, the model is already being deployed by industry leaders including Palantir, Cadence, and Dassault Systèmes.
NVIDIA has launched Nemotron 3 Super, a 120-billion-parameter model built specifically for agentic AI applications that demand both speed and precision
1
. The model addresses two critical bottlenecks facing multi-agent systems: context explosion and what NVIDIA calls the "thinking tax." According to the company, multi-agent workflows generate up to 15 times more tokens than standard chat interactions because each step requires resending full histories, tool outputs, and intermediate reasoning2
. This volume of context increases costs and can lead to goal drift, where agents lose alignment with their original objective1
.
Source: Wccftech
The open weights AI model is available immediately on build.nvidia.com, Perplexity, OpenRouter, and Hugging Face
3
. Enterprise deployment options include Google Cloud's Vertex AI, Oracle Cloud Infrastructure, and soon Amazon Bedrock and Microsoft Azure4
. Companies like Perplexity are offering users access to Nemotron 3 Super for search and as one of 20 orchestrated models in Computer, while software development platforms including CodeRabbit, Factory, and Greptile are integrating it into their AI agents1
.At the core of Nemotron 3 Super lies a sophisticated hybrid mixture-of-experts architecture that merges three distinct innovations
2
. The model utilizes a Hybrid Mamba-Transformer backbone, interleaving Mamba-2 layers with strategic Transformer layers. Mamba layers deliver 4x higher memory and compute efficiency, acting like a fast-travel highway system that handles sequence processing with linear-time complexity1
. NVIDIA strategically inserts Transformer layers as "global anchors," ensuring the model can precisely retrieve specific facts buried deep within codebases or financial reports2
.The model introduces Latent MoE, a technique that improves accuracy by activating four expert specialists for the cost of one during inference
1
. Traditional mixture-of-experts designs route tokens to experts in their full hidden dimension, creating computational bottlenecks. Latent MoE solves this by projecting tokens into a compressed space before routing, allowing the model to consult four times as many specialists at the same computational cost2
. Only 12 billion of its 120 billion parameters are active at inference, dramatically reducing computational overhead3
.Nemotron 3 Super features a 1-million-token context window that allows agents to retain full workflow state in memory
1
. This extensive window is four times larger than Kimi 2.5's context capacity and follows a common principle in agentic systems: the bigger the window, the better the response5
. The Mamba architecture relies on State Space Model (SSM) to read data linearly, preventing large context windows from accumulating irrelevant information while maintaining optimal context for user workloads5
.For practical applications, a software development agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation
1
. In financial analysis, the model can load thousands of pages of reports into memory, eliminating the need to re-reason across long conversations and improving efficiency. The model also demonstrates high-accuracy tool calling that ensures autonomous agents reliably navigate massive function libraries to prevent execution errors in high-stakes environments like cybersecurity1
.
Source: NVIDIA
The most significant technical advancement in Nemotron 3 Super is its optimization for the NVIDIA Blackwell GPU platform
2
. Running in NVFP4 precision, the model cuts memory requirements and pushes inference up to 4x faster than FP8 on NVIDIA Hopper, with no loss in accuracy1
. The model achieves 478 output tokens per second, making it the fastest model available and delivering 7.5x higher inference throughput than Qwen3.5-122B4
.
Source: VentureBeat
Multi-Token Prediction further accelerates performance by predicting several future tokens simultaneously, serving as a built-in draft model that enables native speculative decoding
2
. This approach delivers up to 3x wall-clock speedups for structured generation tasks like code or tool calls1
. The model achieves up to 2.2x higher throughput than gpt-oss-120B in high-volume settings2
.Related Stories
Nemotron 3 Super has claimed the top position on Artificial Analysis for efficiency and openness with leading accuracy among models of the same size
1
. The model powers the NVIDIA AI-Q research agent to the No. 1 position on DeepResearch Bench and DeepResearch Bench II leaderboards, which measure an AI system's ability to conduct thorough, multistep research across large document sets while maintaining reasoning coherence1
. When tested on PinchBench, a suite used to evaluate agent workloads, the model scored 85.6% across the full test suite, surpassing Opus 4.5, Kimi 2.5, and GPT-OSS 120b5
.Industry leaders including Amdocs, Palantir, Cadence, Dassault Systèmes, and Siemens are deploying and customizing the model to automate workflows in telecommunications, cybersecurity, semiconductor design, and manufacturing
1
. Life sciences organizations like Edison Scientific and Lila Sciences will power their agents for deep literature search, data science, and molecular understanding1
. Dell Technologies is bringing the model to the Dell Enterprise Hub on Hugging Face, optimized for on-premise deployment on the Dell AI Factory1
.NVIDIA is releasing Nemotron 3 Super with open weights under a permissive license, though it includes distinct safeguard clauses that differentiate it from pure open-source licenses like MIT or Apache 2.0
2
. The company trained the model on synthetic data generated using frontier reasoning models and is publishing over 10 trillion tokens of pre- and post-training datasets, 15 training environments for reinforcement learning, and evaluation recipes1
4
. Researchers can use the NVIDIA NeMo platform to fine-tune the model or build their own1
. The release positions NVIDIA as a leading contributor to open-source AI models in the West, competing with Chinese AI labs while dominating not just infrastructure but also the model layer5
.Summarized by
Navi
[2]
[3]
15 Dec 2025•Technology

19 Aug 2025•Technology

19 Mar 2025•Technology

1
Technology

2
Policy and Regulation

3
Business and Economy
