2 Sources
[1]
Nvidia releases Nemotron 3 Nano Omni: open multimodal model with 30B params, 3B active, for edge AI agents
Nvidia released Nemotron 3 Nano Omni on Tuesday, an open-weight multimodal AI model that unifies vision, audio, and language understanding in a single architecture designed to power autonomous AI agents on edge devices. The model has 30 billion parameters but activates only three billion per forward pass through a mixture-of-experts design, a ratio that allows it to run on a single GPU while matching or exceeding the multimodal capabilities of models several times its size. Nvidia claims nine times higher throughput than comparable open multimodal models with equivalent interactivity, 2.9 times faster single-stream reasoning on multimodal tasks, and roughly nine times greater effective system capacity for video reasoning. The model tops six benchmarks across document intelligence, video understanding, and audio comprehension. It processes text, images, audio, video, documents, charts, and graphical interfaces as inputs and produces text as output, meaning a single model can replace the patchwork of specialised vision, speech, and document-processing models that most enterprise AI deployments currently stitch together. The release, available on Hugging Face under Nvidia's Open Model Agreement with full commercial use rights, represents the most aggressive move yet by the company that sells the infrastructure for AI into the market for the AI itself. Nemotron 3 Nano Omni uses a hybrid Mamba-Transformer architecture with 23 Mamba-2 selective state-space layers, 23 mixture-of-experts layers with 128 experts routing to six per token plus a shared expert, and six grouped-query attention layers. The vision encoder, C-RADIOv4-H, handles variable-resolution images with 16-by-16 patches scaling from 1,024 to 13,312 visual patches per image. The audio encoder, Parakeet-TDT-0.6B-v2, processes speech and environmental audio. Video processing uses three-dimensional convolutions to capture motion between frames rather than treating video as a sequence of still images. The base text model was pretrained on 25 trillion tokens and supports a 256,000-token context window. The architectural choices reflect a specific design philosophy: maximise capability per active parameter rather than total parameters, because edge deployment is constrained not by model size at rest but by compute per inference step. The three-billion active parameters at inference mean the model can run on hardware announced at Nvidia's GTC 2026 developer conference, including the DGX Spark and DGX Station workstations, without requiring the multi-GPU clusters that power larger models in data centres. The mixture-of-experts approach is not new, but its application to a multimodal model at this scale is. Most open multimodal models either use a single dense architecture, which requires all parameters to be active on every inference step, or use separate specialist models stitched together in a pipeline, which introduces latency at each handoff. Nemotron 3 Nano Omni does neither. It routes each token to six of 128 experts within a unified model, meaning vision tokens, audio tokens, and text tokens all flow through the same architecture but activate different expertise depending on the modality. The result is a model that can process a video feed, a spoken instruction, and a document simultaneously without the inter-model latency that makes pipeline architectures unsuitable for real-time agent applications. For enterprise deployments, this collapses the operational complexity of maintaining separate vision, speech, and language models with separate inference endpoints, monitoring, and versioning into a single model serving a single endpoint. Nvidia has spent the AI boom selling infrastructure: GPUs, networking, and the CUDA software ecosystem that locks developers into its hardware. The Nemotron model family, which has been downloaded more than 50 million times in the past year, represents a parallel strategy in which Nvidia also provides the models that run on that infrastructure. The logic is circular but powerful: Nvidia's models are optimised for Nvidia's hardware, and Nvidia's hardware is optimised for Nvidia's models, creating a full-stack ecosystem that competes with the model-plus-cloud offerings from Google, Amazon, and Microsoft. The case for small, domain-specific language models has been made across education, healthcare, and enterprise, and Nemotron 3 Nano Omni extends that argument to multimodal applications: rather than calling a massive cloud model for every vision or audio task, enterprises can run a compact model locally that handles the full perceptual stack. Early enterprise adoption includes Foxconn, Palantir, Aible, ASI, Eka Care, and H Company, with Dell, DocuSign, Infosys, Oracle, and Zefr evaluating the model for production deployment. The use cases, factory-floor visual inspection, document processing, voice agent applications, and screen understanding for computer-use agents, reflect the market Nvidia is targeting: not consumer AI assistants but industrial AI agents that need to see, hear, and read in real time on local hardware. The model is available as an Nvidia NIM microservice, through Amazon SageMaker JumpStart, and on OpenRouter, with deployment options including vLLM, SGLang, Ollama, llama.cpp, and TensorRT-LLM. The breadth of deployment options is itself a competitive statement: Nvidia is making the model runnable everywhere, on every framework, to maximise adoption and deepen the dependency on Nvidia's broader ecosystem. Open-source AI models designed for agentic reasoning are arriving from multiple directions simultaneously. DeepSeek's V4-Pro and V4-Flash, released last week, use a hybrid attention architecture optimised for long-horizon agentic tasks. Meta's Llama models dominate the open-weight text space. Google's Gemini models handle multimodal tasks at cloud scale. OpenAI's GPT models remain the commercial benchmark. What distinguishes Nemotron 3 Nano Omni is not any single capability but the combination: multimodal perception across vision, audio, and text in a single model, with mixture-of-experts efficiency that enables edge deployment, released as open weights with commercial licensing. No other model currently offers all four properties together. The closest comparators, Google's Gemini Nano for on-device and Meta's Llama for open weights, each lack at least one element: Gemini Nano is not open-weight, and Llama's multimodal capabilities do not include audio processing in a unified architecture. The competitive implications extend beyond the model itself. If Nvidia's open models become the default for edge AI agent deployment, the company captures value at every layer of the stack: the GPU that runs inference, the software framework that optimises it, and now the model itself. Competitors who build on Nvidia's models deepen their dependency on Nvidia's hardware. Competitors who build their own models still need Nvidia's GPUs to train them. The agentic AI era is accelerating across the industry, and Nvidia's strategy is to be indispensable at every layer rather than dominant at one. Nemotron 3 Nano Omni is not Nvidia's answer to GPT-4o. It is Nvidia's argument that the future of AI agents will be built on small, efficient, open models running on Nvidia hardware at the edge, rather than large, proprietary models running on someone else's cloud. Whether that argument holds depends on whether the enterprises building the next generation of autonomous systems prefer local control over cloud convenience, and whether a model with three billion active parameters can do the work that currently requires models with hundreds of billions. The benchmarks say it can. The market will decide whether the benchmarks are right.
[2]
Nvidia introduces Nemotron 3 Nano Omni with vision and speech for powerful agentic AI use - SiliconANGLE
Nvidia introduces Nemotron 3 Nano Omni with vision and speech for powerful agentic AI use Nvidia Corp. today launched a powerful reasoning artificial intelligence model that unifies text, vision and speech, capable of acting as the "brains" of faster, smarter agentic AI applications. Dubbed Nemotron 3 Nano Omni, and weighing in at around 30 billion parameters, the new state-of-the-art model uses mixture-of-experts architecture to deliver extremely low latency and provides high flexibility and control. Nvidia combined vision and audio encoders with its 30B-AD3B hybrid MoE architecture to eliminate the need for separate perception modules, allowing its AI model to unify everything into one. The company said this allowed the model to improve efficiency at scale and provide up to nine times faster throughput than other open omni models on the market. "To build useful agents, you can't wait seconds for a model to interpret a screen," said Gautier Cloix, chief executive of H Company. "By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings -- something that wasn't practical before." The result is a lower cost and higher scalability. With its smaller size, it can also be compressed enough to run on higher-end consumer hardware and execute efficiently on enterprise cloud deployments. The company said it is designed to run alongside other proprietary cloud models or other Nvidia Nemotron open models, such as Nemotron 3 Super for high-frequency execution or Super for complex planning. The new model allows for rapid understanding of documents, computer displays, voice activity, video and more. This makes it the perfect interface for working with people and bridging to more complex machine states. It can take conversational replies from a user and quickly turn it around into reasoning. Nvidia said the Nemotron family - including Ultra, Super and Nano - has seen over 50 million downloads in the past year. The Omni variant extends the family's capabilities into the multimodal and agentic domains.
Share
Copy Link
Nvidia unveiled Nemotron 3 Nano Omni, an open multimodal model with 30 billion parameters that unifies vision, audio, and language understanding for autonomous AI agents on edge devices. Using mixture-of-experts architecture, it activates only 3 billion parameters per inference, delivering nine times higher throughput than comparable models while running on a single GPU.
Nvidia released Nemotron 3 Nano Omni on Tuesday, marking a significant shift in how the chip giant positions itself in the AI market
1
. The open multimodal model unifies vision and speech with language understanding in a single architecture, designed specifically to power autonomous AI agents on edge devices2
. With 30 billion parameters but activating only three billion per forward pass through its mixture-of-experts architecture, the model runs on a single GPU while matching or exceeding capabilities of models several times its size1
.The model delivers nine times higher throughput than comparable open multimodal models with equivalent interactivity, 2.9 times faster single-stream reasoning on multimodal tasks, and roughly nine times greater effective system capacity for video reasoning
1
. It tops six benchmarks across document intelligence, video understanding, and audio comprehension, processing text, images, audio, video, documents, charts, and graphical interfaces as inputs while producing text as output1
.
Source: SiliconANGLE
Nemotron 3 Nano Omni employs a hybrid Mamba-Transformer architecture with 23 Mamba-2 selective state-space layers, 23 mixture-of-experts layers with 128 experts routing to six per token plus a shared expert, and six grouped-query attention layers
1
. The vision encoder, C-RADIOv4-H, handles variable-resolution images with 16-by-16 patches scaling from 1,024 to 13,312 visual patches per image, while the audio encoder, Parakeet-TDT-0.6B-v2, processes speech and environmental audio1
.Video processing uses three-dimensional convolutions to capture motion between frames rather than treating video as a sequence of still images
1
. The base text model was pretrained on 25 trillion tokens and supports a 256,000-token context window1
. The architectural choices reflect a specific design philosophy: maximize capability per active parameter rather than total parameters, because edge deployment is constrained not by model size at rest but by compute per inference step1
.The mixture-of-experts approach applied to a multimodal model at this scale represents a departure from traditional architectures
1
. Most open multimodal models either use a single dense architecture requiring all parameters to be active on every inference step, or use separate specialist models stitched together in a pipeline, which introduces latency at each handoff1
. Nemotron 3 Nano Omni routes each token to six of 128 experts within a unified model, meaning vision tokens, audio tokens, and text tokens all flow through the same architecture but activate different expertise depending on the modality1
.This design enables the model to process a video feed, a spoken instruction, and a document simultaneously without the inter-model latency that makes pipeline architectures unsuitable for real-time agent applications
1
. "To build useful agents, you can't wait seconds for a model to interpret a screen," said Gautier Cloix, chief executive of H Company. "By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings -- something that wasn't practical before"2
.Related Stories
For enterprise AI deployments, the model collapses the operational complexity of maintaining separate vision, speech, and language models with separate inference endpoints, monitoring, and versioning into a single model serving a single endpoint
1
. With its smaller size, it can be compressed enough to run on higher-end consumer hardware and execute efficiently on enterprise cloud deployments2
.Early enterprise adoption includes Foxconn, Palantir, Aible, ASI, Eka Care, and H Company, with Dell, DocuSign, Infosys, Oracle, and Zefr evaluating the model for production deployment
1
. Use cases span factory-floor visual inspection, document processing, voice agent applications, and screen understanding for computer-use agents1
.The release, available on Hugging Face under Nvidia's Open Model Agreement with full commercial use rights, represents the most aggressive move yet by the company that sells the infrastructure for AI into the market for the AI itself
1
. Nvidia has spent the AI boom selling infrastructure: GPUs, networking, and the CUDA software ecosystem that locks developers into its hardware1
. The Nemotron model family, which has seen over 50 million downloads in the past year, represents a parallel strategy in which Nvidia also provides the models that run on that infrastructure1
2
.Nvidia's models are optimized for Nvidia's hardware, and Nvidia's hardware is optimized for Nvidia's models, creating a full-stack ecosystem that competes with the model-plus-cloud offerings from Google, Amazon, and Microsoft
1
. The model is designed to run alongside other proprietary cloud models or other Nvidia Nemotron open models, such as Nemotron 3 Super for high-frequency execution or complex planning2
. The case for small, domain-specific language models extends to multimodal applications: rather than calling a massive cloud model for every vision or audio task, enterprises can run a compact model locally that handles the full perceptual stack1
.Summarized by
Navi
[1]
11 Mar 2026•Technology

15 Dec 2025•Technology

19 Aug 2025•Technology

1
Technology

2
Policy and Regulation

3
Science and Research
