6 Sources
[1]
How Cosmos 3 Helps Physical AI Think Before It Acts
The new, open NVIDIA world foundation model brings vision reasoning, multimodal generation and action prediction together to help robots, autonomous vehicles and vision AI agents think before acting in the real world. The real world is always in motion. To operate autonomously, physical AI systems -- including robots, autonomous vehicles (AVs) and smart spaces -- need to understand not just what they see and what caused that to happen, but what's likely to happen next. In a warehouse, a robot may encounter object configurations it's never seen before. On the road, an AV may need to respond when a pedestrian steps out from between parked cars. And in a factory, a safety system must predict where a forklift is heading, not just detect that it's there. Capturing and recreating those scenarios in the real world is slow, expensive and often impossible to repeat at scale. NVIDIA Cosmos 3 is built for that loop. The new world foundation model -- announced today at NVIDIA GTC Taipei at COMPUTEX -- combines vision reasoning and multimodal generation across text, video, images, ambient sound and action in a single model to help developers create world data with physical context.
[2]
NVIDIA launches Cosmos 3, chip-fab tools and humanoid robot platform
NVIDIA unveiled a broad set of technologies aimed at accelerating the development of physical AI systems, expanding its push into humanoid robots, autonomous vehicles, semiconductor manufacturing, and industrial automation. Announced at GTC Taipei, the company's latest releases include Cosmos 3, an open foundation model for physical AI, a new reference humanoid robot built on its Isaac GR00T platform, open-source agent tools for robotics and industrial AI, and new AI-powered semiconductor manufacturing initiatives with TSMC. Together, the announcements highlight NVIDIA's strategy to build a full-stack ecosystem for physical AI, covering everything from synthetic data generation and simulation to real-world deployment. "The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models," said Jensen Huang, founder and CEO of NVIDIA. He added: "The Cosmos 3 family of open, frontier omnimodels gives developers a generational leap in ability to build robots, autonomous vehicles and vision AI that perceive, reason, plan and act in the physical world." At the center of the announcement is Cosmos 3, which NVIDIA describes as the world's first fully open omnimodel capable of understanding and generating text, images, video, ambient sound, and actions within a single system. The model is built on a mixture-of-transformers architecture that combines reasoning and content generation. NVIDIA says Cosmos 3 can serve as a vision-language model, a world model for simulating physical environments, or as the foundation for robot action models. The company claims the model tops several open-model benchmarks for world generation, robotic action policies, and vision understanding. Cosmos 3 is available in multiple versions, including Cosmos 3 Super for high-accuracy robotics and autonomous vehicle applications and Cosmos 3 Nano for faster inference. NVIDIA also launched a collection of open-source physical AI skills and tools that allow AI agents to execute tasks across robotics, vision AI, autonomous driving, healthcare, and industrial digital twins. The tools convert complex development workflows into repeatable, agent-executable processes that can automate data generation, simulation, training, and deployment. For robotics researchers, NVIDIA introduced the Isaac GR00T Reference Humanoid Robot, an open reference design combining a Unitree H2 Plus humanoid robot, Sharpa dexterous hands, Jetson Thor onboard computing, and the Isaac GR00T software stack. The platform is intended to simplify humanoid robot development by integrating hardware, simulation, training, and deployment into a single system. Research organizations including Ai2, ETH Zurich, Stanford Robotics Center, and UC San Diego plan to use the platform. NVIDIA is also bringing AI deeper into semiconductor manufacturing through its collaboration with TSMC. The chipmaker is using NVIDIA CUDA-X libraries and AI models for computational lithography, transistor simulation, process control, wafer inspection, and fab scheduling. According to NVIDIA, TSMC has achieved improvements in computational efficiency while also using NVIDIA Metropolis and TAO Toolkit to improve detection of nanometer-scale defects. "NVIDIA and TSMC have worked together for nearly three decades to push the limits of computing," Huang said. "TSMC is bringing NVIDIA AI and accelerated computing into the fab itself, tackling some of the world's most complex design and manufacturing challenges with simulation, optimization and AI." NVIDIA is also exploring physical AI for autonomous vehicles through Alpamayo 2 Super, a 32-billion-parameter reasoning model designed to help robotaxis understand, plan, and respond to complex road situations.
[3]
Nvidia's new world model helps robots navigate the world
Why it matters: Nvidia is continuing its move beyond chips into AI models and software, positioning itself to become a foundational platform for physical AI development. Driving the news: Nvidia says it trained Cosmos 3 on 20 trillion tokens of multimodal data, including nearly a billion images, 400 million real and synthetic videos, ambient audio, text and action data from humans and robots. * That action data is what makes Cosmos different from a regular video generator. It's meant to model how machines move, not just how scenes look, Ming-Yu Liu, VP of Nvidia's Cosmos Lab, told Axios. Autonomous actions are key. * Developers can use Cosmos 3 to simulate actions in physical environments, then build task-specific models for robots and other machines on top of it. * Cosmos 3 is designed to generate action data -- such as robot joint angles, gripper positions and trajectories -- that can help train machines to navigate and manipulate the physical world. Between the lines: Cosmos is an open model, similar to its early Nemotron family, making it easier for hardware makers to customize Cosmos to their needs and ensure that future versions more closely align to the needs of the industry, Liu said. * Nvidia is also establishing a coalition of companies supporting the effort. Initial partners include Agile Robots, Black Forest Labs and Runway. * Nvidia says Cosmos can generate rare or dangerous scenarios -- such as robot collisions or unusual road events -- that are difficult, expensive or unsafe to capture repeatedly. Zoom in: Nvidia is releasing two versions immediately: a "super" model for tasks requiring high physics accuracy, such as training robots and autonomous vehicles, and a "nano" model that can generate results in fractions of a second. * An "edge" model that can run locally is coming soon, Nvidia said. Zoom out: World models have become a key growth area for AI as companies increasingly want to take the smarts of chatbots and agents and allow them to perform real-world tasks. * Among the hot startups in this area are Fei-Fei Li's World Labs and Yann LeCun's AMI Labs. * "Ultimately what a world model wants to achieve is to help physical agents to become more generalizable," Liu said. "To become more generalizable, you need to understand the world so you understand how it works, so you can make a plan." Bottom line: Nvidia's bet is that the next wave of AI won't just answer questions or generate images -- it will need to predict, simulate and act in the physical world, and Nvidia wants its open models and infrastructure to be the place developers start.
[4]
NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI
* NVIDIA Cosmos 3 is a new leaderboard-topping open physical AI foundation model, built on a breakthrough mixture-of-transformers architecture for physical AI reasoning, world simulation and action generation. * Cosmos 3 is the world's first fully open omnimodel with native vision reasoning and multimodal generation across text, image, video, ambient sound and action for state-of-the-art synthetic data generation and physical AI policy model development. * NVIDIA launches the NVIDIA Cosmos Coalition with leading AI labs and robotics leaders -- including Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI -- to advance the next generation of open world models. NVIDIA GTC Taipei -- NVIDIA today launched NVIDIA Cosmosâ„¢ 3, an open world foundation model for physical AI built on a breakthrough mixture-of-transformers architecture that combines vision reasoning, world generation and action prediction in a single system. Cosmos 3 is the world's first fully open omnimodel that can natively understand and generate text, images, video, ambient sound and actions with leading physics accuracy, reducing physical AI training and evaluation cycles from months to days. NVIDIA also launched the NVIDIA Cosmos Coalition, a global collaboration between world model builders and AI developers -- including Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI -- working together to advance next-generation world models. "The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models," said Jensen Huang, founder and CEO of NVIDIA. "The Cosmos 3 family of open, frontier omnimodels gives developers a generational leap in ability to build robots, autonomous vehicles and vision AI that perceive, reason, plan and act in the physical world." A New Architecture for Physical AI Cosmos 3 tackles a fundamental challenge in physical AI: enabling robots, autonomous vehicles (AVs) or vision agents to generalize in the real world with limited training data and fragmented simulation stacks. The model's mixture-of-transformers architecture pairs a reasoning transformer with an expert generation transformer, enabling Cosmos 3 to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories. Trained on one of the largest multimodal physical AI datasets -- including billions of samples across text, image, video, sound and action trajectories -- the model gives developers a powerful pretrained foundation for building physical AI systems with less data and lower training costs. Developers can use Cosmos 3 as: * A vision language model that understands and reasons across modalities. * A world model or video foundation model that simulates physical environments and predicts future world states for training and evaluation. * The backbone for world action models that help train robots to perform specific tasks. Cosmos 3 models deliver leading results on physical AI benchmarks. Among open models, it ranks first across Artificial Analysis, Physics-IQ, PAI-Bench and R-Bench for world generation accuracy, RoboLab and RoboArena for action policy, and the VANTAGE-Bench and TAR leaderboards for vision understanding. The Cosmos 3 lineup gives developers options for different stages of physical AI development: * Cosmos 3 Super for post-training robotics and AV models that need the highest physics accuracy and generation quality. * Cosmos 3 Nano for high-quality video and action reasoning in fractions of a second. * Cosmos 3 Edge, coming soon, for real-time inference at the edge. Cosmos Coalition Accelerates Open World Model Development The Cosmos Coalition is a global collaboration between world model builders, AI developers and physical AI leaders to advance open world models across industries, enabling members to contribute models, research and evaluation techniques while using Cosmos 3 technologies, training tools and NVIDIA DGXâ„¢ Cloud infrastructure for large-scale training. Founding coalition members include Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI. By building in the open and contributing across a shared ecosystem, the coalition aims to enable faster innovation, broader interoperability and more rapid advances in physical AI. Developers Build on Cosmos The Cosmos platform powers NVIDIA's physical AI stack to accelerate training and evaluation workflows across industries. The platform now includes new datasets for robotics, physics, human motion, autonomous driving, warehouse safety and spatial reasoning, as well as new physical AI agent skills for neural scene reconstruction, defect-image generation and video augmentation. Physical AI developers are building on the Cosmos platform across industries -- Agile Robots, Doosan Robotics, LG Electronics, Samsung and Skild AI for robotics, Li Auto for AVs, and Centific, Fogsphere, Linker Vision, Milestone Systems and Yuan for vision AI agents to power industrial AI and smart spaces applications. Availability Cosmos 3 Super and Cosmos 3 Nano are available now, with Cosmos 3 Edge coming soon for real-time inference. Developers can try Cosmos 3 on build.nvidia.com, download open models from Hugging Face, customize models and generate synthetic data with Hugging Face Diffusers and resources on GitHub, and deploy the models as NVIDIA NIMâ„¢ microservices. Model builders and software providers can accelerate access, customization and deployment of Cosmos for key reasoning and synthetic data generation workloads using physical AI agent skills on GitHub through inference services and cloud infrastructure partners including Baseten, CoreWeave, Microsoft Azure, Nebius, Deep Infra and Classmethod. Watch the keynote from Huang, learn more at NVIDIA GTC Taipei and explore these physical AI sessions.
[5]
Why NVIDIA's Cosmos 3 is a Massive Leap for Multimodal AI
NVIDIA's Cosmos 3, introduced at GTC Taipei, represents a significant leap in multimodal AI by unifying five distinct data types, text, images, videos, audio and actions, into a single framework. This integration eliminates the need for separate models, streamlining complex tasks like text-to-video generation or predictive modeling. Sam Witteveen highlights how the model's dual-tower transformer architecture, featuring an Autoregressive Reasoner and a Diffusion-Based Generation Tower, ensures both precise data interpretation and high-quality output creation. These innovations make Cosmos 3 particularly suited for applications in robotics, synthetic data generation and immersive media. Explore how Cosmos 3's scalable configurations, such as the high-performance Super version or the compact Nano variant, cater to diverse computational needs. Gain insight into its role in advancing fields like physical AI and predictive modeling, where multimodal integration enhances decision-making and task execution. Whether you're interested in AI-driven content creation or deploying edge solutions, this overview offers a comprehensive breakdown of the model's potential across industries. What Sets Cosmos 3 Apart? Cosmos 3's defining strength lies in its ability to seamlessly process and generate outputs across multiple modalities. Whether tasked with analyzing text, interpreting images, generating videos, processing audio, or predicting actions, this model performs all these functions within a unified framework. Unlike traditional systems that rely on separate, interconnected models, Cosmos 3 ensures consistency, efficiency and accuracy in multimodal tasks. For example, it can transform a textual description into a detailed video or image, making it a versatile tool for creative industries and technical applications alike. This capability is particularly valuable for industries that require complex data synthesis and interpretation, such as entertainment, education and advanced research. By integrating diverse data types into a cohesive system, Cosmos 3 redefines the potential of multimodal AI. Architectural Innovations At the heart of Cosmos 3 is its dual-tower transformer architecture, carefully designed to optimize both input processing and output generation. This architecture comprises two specialized components: * Autoregressive Reasoner: Responsible for processing and interpreting multimodal inputs, making sure accurate understanding of diverse data types. * Diffusion-Based Generation Tower: Focused on generating high-quality outputs, such as synthetic images, videos, or audio, with exceptional precision and detail. These two towers are interconnected through a shared multimodal attention mechanism, which ensures coherence and consistency across different data types. This streamlined design not only enhances performance but also simplifies the deployment of complex AI systems. By integrating these components into a unified framework, Cosmos 3 makes it easier to implement advanced AI solutions across a wide range of industries. Discover other guides from our vast content that could be of interest on NVIDIA. Scalable Configurations for Diverse Needs To meet the varying demands of different applications, Cosmos 3 is available in multiple scalable configurations: * Cosmos 3 Super: Featuring 32 billion parameters per tower, this version is tailored for high-performance, resource-intensive applications such as advanced robotics and large-scale simulations. * Cosmos 3 Nano: A compact version with 8 billion parameters per tower, offering a total of 16 billion parameters. This configuration is ideal for tasks requiring efficiency and scalability without compromising functionality. * Edge Version (Upcoming): Optimized for real-time, on-device processing, this version is designed for edge computing scenarios, allowing AI capabilities in environments with limited connectivity or computational resources. These variants provide flexibility, allowing organizations to choose a model that aligns with their specific computational requirements and operational goals. Whether you're working on large-scale projects or deploying AI at the edge, Cosmos 3 offers a tailored solution. Applications Across Industries Cosmos 3's multimodal capabilities unlock a wide array of applications across various industries, demonstrating its versatility and fantastic potential: * Synthetic Data Generation: Enables the creation of training datasets for robotics and physical AI systems, significantly reducing the need for extensive real-world data collection. * Predictive Modeling: Supports forward dynamics prediction and action modeling, which are critical for robotics, automation and simulation tasks. * Text-to-Video and Text-to-Image Transformation: Converts textual inputs into rich visual or video outputs, streamlining content creation, simulation and educational processes. * Advanced Robotics: Enhances robotic systems by integrating multimodal data for improved decision-making and task execution. * Entertainment and Media: Facilitates the development of immersive experiences, such as AI-generated films, interactive media and personalized content. These use cases highlight the model's potential to drive innovation in fields ranging from entertainment and education to robotics and advanced AI research. By allowing seamless integration of diverse data types, Cosmos 3 opens new possibilities for creative and technical applications. Technical Foundations and Advancements Cosmos 3 builds on a foundation of advanced pre-trained models, such as Kwenta 3VL and Variational Autoencoders (VAEs), to deliver robust functionality. Its pre-training on diverse datasets ensures strong generalization capabilities, while supervised fine-tuning tailors the model for specific tasks and industries. The diffusion-based generation mechanism further enhances the quality of outputs, particularly in image and video synthesis. This approach ensures that Cosmos 3 maintains high accuracy and adaptability across a wide range of applications. By combining innovative techniques with a scalable architecture, Cosmos 3 sets a new standard for multimodal AI systems. The Significance of Cosmos 3 Cosmos 3 represents a pivotal step forward in bridging the gap between digital intelligence and real-world applications. By allowing seamless multimodal integration, it accelerates advancements in physical AI, robotics and other innovative fields. Its scalable architecture and comprehensive capabilities also contribute to progress toward Artificial General Intelligence (AGI), bringing us closer to AI systems that can perform a wide range of tasks with human-like versatility. Whether you're developing creative AI applications, advancing robotics, or exploring new frontiers in technology, Cosmos 3 provides a powerful foundation for innovation. Its ability to unify diverse data types into a cohesive system sets a new benchmark for AI development, paving the way for future breakthroughs in intelligence, automation and beyond. Media Credit: Sam Witteveen Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.
[6]
NVIDIA Calls Cosmos 3 The World's First Fully Open Omnimodel, As Robots And Autonomous Vehicles Get A Powerful Brain Grounded In Physics
NVIDIA has just announced its Cosmos 3 world model at the ongoing GTC Taipei, giving us a glimpse at what it calls the world's first "fully open omnimodel" that is capable of vision-based reasoning, while supporting multimodal output in the form of text, image, video, and ambient sound. NVIDIA's Cosmos 3 "pairs a reasoning transformer with an expert generation transformer," allowing the model to grasp physical interactions before generating video and action content that leverages those interactions At its heart, the Cosmos 3 tackles the challenge of making robots, autonomous vehicles (AVs), and vision agents understand their surroundings in an environment where training data is limited and simulation stacks remain fragmented. NVIDIA's Cosmos 3 is an open omnimodel, which means it is able to "natively understand and generate text, images, video, ambient sound and actions with leading physics accuracy." Its unique strength lies in it's architecture, which pairs reasoning transformers with those geared towards generation, "enabling Cosmos 3 to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories." For the benefit of those who might not be aware, an AI transformer is basically a deep learning neural network that tracks relationships and context within sequential data, which might include words in a sentence. These networks can substantially speed up output generation by undertaking parallel processing, where a given data sequence is analyzed simultaneously instead of piece-by-piece. Coming back, according to NVIDIA, you can use the Cosmos 3 as a: Finally, do note that Cosmos 3 Super, which has the highest-fidelity responses, and Cosmos 3 Nano are available right now, with Cosmos 3 Edge coming soon for real-time inference, that too, geared towards edge devices. Follow Wccftech on Google to get more of our news coverage in your feeds.
Share
Copy Link
NVIDIA launched Cosmos 3, an open world foundation model for physical AI that combines vision reasoning, multimodal generation and action prediction. Trained on 20 trillion tokens including nearly a billion images and 400 million videos, the model helps robots and autonomous vehicles understand causal relationships and predict outcomes before acting in real-world environments.
NVIDIA unveiled NVIDIA Cosmos 3 at GTC Taipei during COMPUTEX, marking a significant expansion in the company's push beyond chips into physical AI systems
1
2
. The new open world foundation model addresses a fundamental challenge: enabling robots and autonomous systems to operate in real-world environments where capturing and recreating scenarios is slow, expensive, and often impossible to repeat at scale1
. "The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models," said Jensen Huang, founder and CEO of NVIDIA4
.
Source: Geeky Gadgets
NVIDIA trained Cosmos 3 on 20 trillion tokens of multimodal data, including nearly a billion images, 400 million real and synthetic videos, ambient audio, text and action data from humans and robots
3
. This massive dataset gives developers a powerful pretrained foundation for building physical AI systems with less data and lower training costs4
. The action data distinguishes Cosmos from regular video generators, as it's designed to model how machines move, not just how scenes look, according to Ming-Yu Liu, VP of NVIDIA's Cosmos Lab3
. The multimodal AI model can generate rare or dangerous scenarios such as robot collisions or unusual road events that are difficult, expensive or unsafe to capture repeatedly3
.The foundation model is built on a breakthrough mixture-of-transformers architecture that pairs a reasoning transformer with an expert generation transformer
4
. This dual-tower design enables vision reasoning and multimodal generation across text, images, video, ambient sound and action in a single system1
. The architecture allows Cosmos 3 to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories4
. Developers can use the model as a vision language model, a world model for simulating physical environments, or as the backbone for world action models that help train robotics systems to perform specific tasks4
.
Source: NVIDIA
Cosmos 3 is designed to generate action data such as robot joint angles, gripper positions and trajectories that can help train machines to navigate and manipulate the physical world
3
. In a warehouse, a robot may encounter object configurations it's never seen before, while on the road, an autonomous vehicle may need to respond when a pedestrian steps out from between parked cars1
. The model delivers leading results on physical AI benchmarks, ranking first among open models across Artificial Analysis, Physics-IQ, PAI-Bench and R-Bench for world generation accuracy, RoboLab and RoboArena for action policy, and the VANTAGE-Bench and TAR leaderboards for vision understanding4
.NVIDIA is releasing two versions immediately: Cosmos 3 Super, a 32-billion-parameter model for tasks requiring high physics accuracy such as training robots and autonomous vehicles, and Cosmos 3 Nano with 8 billion parameters per tower for faster inference that can generate results in fractions of a second
3
5
. An edge model that can run locally for real-time, on-device processing is coming soon3
5
. The model reduces physical AI training and evaluation cycles from months to days4
.Related Stories
NVIDIA launched the Cosmos Coalition, a global collaboration between world model builders and AI developers including Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI to advance next-generation world models in AI
4
. The company also introduced the Isaac GR00T Reference Humanoid Robot, an open reference design combining a Unitree H2 Plus humanoid robot, Sharpa dexterous hands, Jetson Thor onboard computing, and the Isaac GR00T software stack2
. Research organizations including Ai2, ETH Zurich, Stanford Robotics Center, and UC San Diego plan to use the platform2
.NVIDIA is bringing AI for semiconductor manufacturing deeper into production through its collaboration with TSMC
2
. TSMC is using NVIDIA CUDA-X libraries and AI models for computational lithography, transistor simulation, process control, wafer inspection, and fab scheduling, achieving improvements in computational efficiency while using NVIDIA Metropolis and TAO Toolkit to improve detection of nanometer-scale defects2
. The announcements highlight NVIDIA's strategy to build a full-stack ecosystem for physical AI covering everything from synthetic data generation and simulation to real-world deployment in industrial automation2
. Physical AI developers across industries are building on the Cosmos platform, including companies like Li Auto for autonomous vehicles and Samsung for robotics applications4
. NVIDIA's bet is that the next wave of AI won't just answer questions or generate images but will need to predict, simulate and act in the physical world, with AI agents capable of understanding causal relationships and executing complex tasks3
.Summarized by
Navi
[2]
[5]
07 Jan 2025•Technology

12 Aug 2025•Technology

07 Jan 2025•Technology

1
Startups

2
Policy and Regulation

3
Policy and Regulation
