NVIDIA Cosmos 3 Brings Multimodal AI to Robots, Autonomous Vehicles and Industrial Systems

Reviewed byNidhi Govil

6 Sources

Share

NVIDIA launched Cosmos 3, an open world foundation model for physical AI that combines vision reasoning, multimodal generation and action prediction. Trained on 20 trillion tokens including nearly a billion images and 400 million videos, the model helps robots and autonomous vehicles understand causal relationships and predict outcomes before acting in real-world environments.

NVIDIA Introduces Cosmos 3 at GTC Taipei

NVIDIA unveiled NVIDIA Cosmos 3 at GTC Taipei during COMPUTEX, marking a significant expansion in the company's push beyond chips into physical AI systems

1

2

. The new open world foundation model addresses a fundamental challenge: enabling robots and autonomous systems to operate in real-world environments where capturing and recreating scenarios is slow, expensive, and often impossible to repeat at scale

1

. "The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models," said Jensen Huang, founder and CEO of NVIDIA

4

.

Source: Geeky Gadgets

Source: Geeky Gadgets

Foundation Model Trained on 20 Trillion Tokens

NVIDIA trained Cosmos 3 on 20 trillion tokens of multimodal data, including nearly a billion images, 400 million real and synthetic videos, ambient audio, text and action data from humans and robots

3

. This massive dataset gives developers a powerful pretrained foundation for building physical AI systems with less data and lower training costs

4

. The action data distinguishes Cosmos from regular video generators, as it's designed to model how machines move, not just how scenes look, according to Ming-Yu Liu, VP of NVIDIA's Cosmos Lab

3

. The multimodal AI model can generate rare or dangerous scenarios such as robot collisions or unusual road events that are difficult, expensive or unsafe to capture repeatedly

3

.

Mixture-of-Transformers Architecture Powers Vision Reasoning

The foundation model is built on a breakthrough mixture-of-transformers architecture that pairs a reasoning transformer with an expert generation transformer

4

. This dual-tower design enables vision reasoning and multimodal generation across text, images, video, ambient sound and action in a single system

1

. The architecture allows Cosmos 3 to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories

4

. Developers can use the model as a vision language model, a world model for simulating physical environments, or as the backbone for world action models that help train robotics systems to perform specific tasks

4

.

Source: NVIDIA

Source: NVIDIA

Action Prediction Enables Real-World Applications

Cosmos 3 is designed to generate action data such as robot joint angles, gripper positions and trajectories that can help train machines to navigate and manipulate the physical world

3

. In a warehouse, a robot may encounter object configurations it's never seen before, while on the road, an autonomous vehicle may need to respond when a pedestrian steps out from between parked cars

1

. The model delivers leading results on physical AI benchmarks, ranking first among open models across Artificial Analysis, Physics-IQ, PAI-Bench and R-Bench for world generation accuracy, RoboLab and RoboArena for action policy, and the VANTAGE-Bench and TAR leaderboards for vision understanding

4

.

Scalable Versions for Different Use Cases

NVIDIA is releasing two versions immediately: Cosmos 3 Super, a 32-billion-parameter model for tasks requiring high physics accuracy such as training robots and autonomous vehicles, and Cosmos 3 Nano with 8 billion parameters per tower for faster inference that can generate results in fractions of a second

3

5

. An edge model that can run locally for real-time, on-device processing is coming soon

3

5

. The model reduces physical AI training and evaluation cycles from months to days

4

.

Cosmos Coalition and Isaac GR00T Reference Humanoid Robot

NVIDIA launched the Cosmos Coalition, a global collaboration between world model builders and AI developers including Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI to advance next-generation world models in AI

4

. The company also introduced the Isaac GR00T Reference Humanoid Robot, an open reference design combining a Unitree H2 Plus humanoid robot, Sharpa dexterous hands, Jetson Thor onboard computing, and the Isaac GR00T software stack

2

. Research organizations including Ai2, ETH Zurich, Stanford Robotics Center, and UC San Diego plan to use the platform

2

.

Expanding Into Semiconductor Manufacturing and Industrial Automation

NVIDIA is bringing AI for semiconductor manufacturing deeper into production through its collaboration with TSMC

2

. TSMC is using NVIDIA CUDA-X libraries and AI models for computational lithography, transistor simulation, process control, wafer inspection, and fab scheduling, achieving improvements in computational efficiency while using NVIDIA Metropolis and TAO Toolkit to improve detection of nanometer-scale defects

2

. The announcements highlight NVIDIA's strategy to build a full-stack ecosystem for physical AI covering everything from synthetic data generation and simulation to real-world deployment in industrial automation

2

. Physical AI developers across industries are building on the Cosmos platform, including companies like Li Auto for autonomous vehicles and Samsung for robotics applications

4

. NVIDIA's bet is that the next wave of AI won't just answer questions or generate images but will need to predict, simulate and act in the physical world, with AI agents capable of understanding causal relationships and executing complex tasks

3

.

Today's Top Stories

© 2026 TheOutpost.AI All rights reserved