NVIDIA Launches Nemotron 3 Super with 5x Higher Throughput for Agentic AI Systems

Reviewed byNidhi Govil

5 Sources

Share

NVIDIA unveiled Nemotron 3 Super, a 120-billion-parameter open weights AI model designed to tackle the challenges of multi-agent systems. The hybrid mixture-of-experts architecture delivers 5x higher throughput while addressing context explosion and the thinking tax that plague autonomous agent workflows. Available now on platforms like Hugging Face and Perplexity, the model is already being deployed by industry leaders including Palantir, Cadence, and Dassault Systèmes.

NVIDIA Releases Open Weights AI Model for Complex Agent Workflows

NVIDIA has launched Nemotron 3 Super, a 120-billion-parameter model built specifically for agentic AI applications that demand both speed and precision

1

. The model addresses two critical bottlenecks facing multi-agent systems: context explosion and what NVIDIA calls the "thinking tax." According to the company, multi-agent workflows generate up to 15 times more tokens than standard chat interactions because each step requires resending full histories, tool outputs, and intermediate reasoning

2

. This volume of context increases costs and can lead to goal drift, where agents lose alignment with their original objective

1

.

Source: Wccftech

Source: Wccftech

The open weights AI model is available immediately on build.nvidia.com, Perplexity, OpenRouter, and Hugging Face

3

. Enterprise deployment options include Google Cloud's Vertex AI, Oracle Cloud Infrastructure, and soon Amazon Bedrock and Microsoft Azure

4

. Companies like Perplexity are offering users access to Nemotron 3 Super for search and as one of 20 orchestrated models in Computer, while software development platforms including CodeRabbit, Factory, and Greptile are integrating it into their AI agents

1

.

Hybrid Mixture-of-Experts Architecture Drives Performance Gains

At the core of Nemotron 3 Super lies a sophisticated hybrid mixture-of-experts architecture that merges three distinct innovations

2

. The model utilizes a Hybrid Mamba-Transformer backbone, interleaving Mamba-2 layers with strategic Transformer layers. Mamba layers deliver 4x higher memory and compute efficiency, acting like a fast-travel highway system that handles sequence processing with linear-time complexity

1

. NVIDIA strategically inserts Transformer layers as "global anchors," ensuring the model can precisely retrieve specific facts buried deep within codebases or financial reports

2

.

The model introduces Latent MoE, a technique that improves accuracy by activating four expert specialists for the cost of one during inference

1

. Traditional mixture-of-experts designs route tokens to experts in their full hidden dimension, creating computational bottlenecks. Latent MoE solves this by projecting tokens into a compressed space before routing, allowing the model to consult four times as many specialists at the same computational cost

2

. Only 12 billion of its 120 billion parameters are active at inference, dramatically reducing computational overhead

3

.

1-Million-Token Context Window Prevents Agent Misalignment

Nemotron 3 Super features a 1-million-token context window that allows agents to retain full workflow state in memory

1

. This extensive window is four times larger than Kimi 2.5's context capacity and follows a common principle in agentic systems: the bigger the window, the better the response

5

. The Mamba architecture relies on State Space Model (SSM) to read data linearly, preventing large context windows from accumulating irrelevant information while maintaining optimal context for user workloads

5

.

For practical applications, a software development agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation

1

. In financial analysis, the model can load thousands of pages of reports into memory, eliminating the need to re-reason across long conversations and improving efficiency. The model also demonstrates high-accuracy tool calling that ensures autonomous agents reliably navigate massive function libraries to prevent execution errors in high-stakes environments like cybersecurity

1

.

Source: NVIDIA

Source: NVIDIA

NVIDIA Blackwell Optimization Delivers Speed Breakthrough

The most significant technical advancement in Nemotron 3 Super is its optimization for the NVIDIA Blackwell GPU platform

2

. Running in NVFP4 precision, the model cuts memory requirements and pushes inference up to 4x faster than FP8 on NVIDIA Hopper, with no loss in accuracy

1

. The model achieves 478 output tokens per second, making it the fastest model available and delivering 7.5x higher inference throughput than Qwen3.5-122B

4

.

Source: VentureBeat

Source: VentureBeat

Multi-Token Prediction further accelerates performance by predicting several future tokens simultaneously, serving as a built-in draft model that enables native speculative decoding

2

. This approach delivers up to 3x wall-clock speedups for structured generation tasks like code or tool calls

1

. The model achieves up to 2.2x higher throughput than gpt-oss-120B in high-volume settings

2

.

Benchmark Performance and Enterprise Adoption

Nemotron 3 Super has claimed the top position on Artificial Analysis for efficiency and openness with leading accuracy among models of the same size

1

. The model powers the NVIDIA AI-Q research agent to the No. 1 position on DeepResearch Bench and DeepResearch Bench II leaderboards, which measure an AI system's ability to conduct thorough, multistep research across large document sets while maintaining reasoning coherence

1

. When tested on PinchBench, a suite used to evaluate agent workloads, the model scored 85.6% across the full test suite, surpassing Opus 4.5, Kimi 2.5, and GPT-OSS 120b

5

.

Industry leaders including Amdocs, Palantir, Cadence, Dassault Systèmes, and Siemens are deploying and customizing the model to automate workflows in telecommunications, cybersecurity, semiconductor design, and manufacturing

1

. Life sciences organizations like Edison Scientific and Lila Sciences will power their agents for deep literature search, data science, and molecular understanding

1

. Dell Technologies is bringing the model to the Dell Enterprise Hub on Hugging Face, optimized for on-premise deployment on the Dell AI Factory

1

.

Open Weights Release with Extensive Training Data

NVIDIA is releasing Nemotron 3 Super with open weights under a permissive license, though it includes distinct safeguard clauses that differentiate it from pure open-source licenses like MIT or Apache 2.0

2

. The company trained the model on synthetic data generated using frontier reasoning models and is publishing over 10 trillion tokens of pre- and post-training datasets, 15 training environments for reinforcement learning, and evaluation recipes

1

4

. Researchers can use the NVIDIA NeMo platform to fine-tune the model or build their own

1

. The release positions NVIDIA as a leading contributor to open-source AI models in the West, competing with Chinese AI labs while dominating not just infrastructure but also the model layer

5

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo