4 Sources
[1]
Alibaba unveils AI models for robots as China's focus shifts to agents
The company wants to be China's "AI factory," spanning chips, models, and the agents built on top of them. Alibaba has revealed its first suite of AI models for robots, a move that says as much about where Chinese technology is heading as about the models themselves. The launch came as the industry pivots away from chatbots and towards agents, the systems meant to carry out complex tasks rather than just answer questions. At the centre is RynnBrain, a system built to help machines understand space, objects, and motion, the perceptual groundwork a robot needs before it can act in the physical world. In a demonstration released by Alibaba's DAMO Academy research arm, a robot identifies a piece of fruit and places it in a basket, a small task that stands in for a large ambition. Alongside it, Alibaba announced Qwen3.7-Max, the latest in its proprietary large-language-model line, pitched as a foundation for AI agents. The company said the model could run autonomously for up to 35 hours without performance degrading, a claim aimed at the durability that agentic work demands, since an agent that drifts after a few hours is of little use for tasks that take days. The figure is the company's own. The framing Alibaba chose for itself is "AI factory." It described itself as the only company in China operating all five layers of what it calls the full AI stack, from chips through an agentic cloud, models, model-serving platforms, and the applications on top. The pitch is vertical integration as a competitive moat: own every layer, and the gains at one compound through the rest. It is also the language of physical AI, the convergence of models and machines that rivals from Google to Siemens have been pursuing on the factory floor. The shift from chatbots to agents is the strategic backdrop. Chinese firms, like their American counterparts, have concluded that the more lucrative business is not the conversational model but the system that can take actions, book, buy, operate, schedule, on a user's behalf. Robotics is the most physical expression of that bet, extending the agent from the screen into the warehouse and the home, the same territory an Nvidia-powered humanoid robot has already begun testing in live logistics work. The launch also has a competitive edge to it. Alibaba is racing the other Chinese technology giants, and the American labs, to define what the agent era looks like, and robotics is a field where Chinese manufacturers already hold real advantages in hardware and supply chain. Pairing a domestic model stack with that manufacturing base is the kind of vertical play that is harder for a software-only rival to match, and it fits a national strategy that treats both AI and robotics as priorities. If the demonstrations translate into deployed products is the open question, and it usually is with robotics, where the gap between a controlled demo and a reliable machine has humbled many. Alibaba has not detailed pricing, availability, or which customers will get the robot models first. What it has set out is a position: a claim to span the whole stack at the moment the industry decides agents, not chatbots, are the prize.
[2]
Alibaba Is Building Qwen-Robot: The Operating System for the Robot Economy
The company says its models top multiple robotics benchmarks, using millions of training samples and tens of thousands of hours of open-source robot data. Alibaba's Qwen team dropped the Qwen-Robot Suite on Tuesday: three foundation models forming what they call a "full stack for embodied intelligence." Qwen-RobotNav handles mobility. Qwen-RobotManip handles manipulation. Qwen-RobotWorld simulates the physics that make both possible. Each works independently. Together, they're the Android moment for robotics -- the operating system, not the hardware. Alibaba is right now the only company in China spanning chips, cloud, models, serving platforms, and applications. For the company, robotics is the most physical expression of that bet, what is known as embodied AI. AI agents currently rely on LLMs to power their decisions. The usual way robots work is by machine-learning models which, although advanced, lack the adaptability of generative AI. Physical agents face a different, harder class of failure modes: physics, not prompts. For these use cases, Alibaba introduced this new AI suite with different components: Qwen-RobotNav unifies five navigation tasks -- instruction following, point-goal navigation, object search, target tracking, and autonomous driving -- each demanding different visual memory strategies. Most models hardcode one strategy. Qwen-RobotNav exposes a parameterized interface: token budget, temporal decay, per-camera weights that a planner can reconfigure mid-episode. Trained on 15.6 million samples with randomization across all parameters, it achieves 76.5% success on VLN-CE RxR, a benchmark for vision-and-language navigation in real-world environments, and 90% tracking on EVT-Bench, which evaluates an agent's ability to consistently follow moving targets. Qwen-RobotManip tackles one of the biggest challenges in robotic manipulation: different robots represent actions in fundamentally different ways. A Franka arm (a type of robot with seven axis of movement) operates through joint angles, while an ALOHA robot (a low-cost bimanual robot platform widely used in robotics research) represents actions through the position and orientation of its grippers (end-effector poses). Humanoids add another layer of complexity, using whole-body coordinates. To bridge these incompatible action spaces, Alibaba synthesized approximately 38,100 hours of training data from open-source robot datasets and human videos -- without relying on proprietary data collection. The model ranks first on RoboChallenge Table30-v1, outperforming previous approaches by 20%. Qwen-RobotWorld is the most ambitious: a language-conditioned video world model treating natural language as a universal action interface. "Pick up the red cup and pour water on the flower" works whether the actor is a gripper, an autonomous vehicle, or a mobile navigation agent. The Embodied World Knowledge corpus spans 8.6 million video-text pairs -- 200 million frames -- across manipulation (5.9 million samples, 1,300+ skills, 20+ morphologies), autonomous driving (Waymo, NVIDIA PhysicalAI-AD, Bench2Drive), indoor navigation (VLNVerse), and human-to-robot transfer across 14 robot arms. It ranks first on EWMBench and DreamGen Bench, two benchmarks that evaluate if world models predict and generate realistic physical environments. It also beats all open-source models on WorldModelBench and PBench, and scores perfectly on physics adherence: Newton's laws, mass conservation, fluid dynamics, gravity. The ChatGPT of robots? While Western labs (Google DeepMind, Nvidia, Figure, Physical Intelligence) pursue similar goals, most focus on navigation or manipulation, not a unified, composable suite. Alibaba's vertical integration from chips through applications means they control the full stack. The open-source foundation differentiates against competitors relying on private robot data. There are some misconceptions that could be worth clearing: These are not robots but software models -- brains, not bodies. They run on hardware from AgileX, Franka, Universal Robots, Unitree, and others. Also, despite these being generative AI models for robots, these aren't LLMs like your typical ChatGPT. A language model predicts tokens. These models must understand physics, spatial relationships, and consequences of physical actions. A language model tells you a glass breaks if dropped. Qwen-RobotWorld predicts how it breaks -- shatter pattern, fluid dynamics, secondary collisions. Qwen-RobotManip plans a grasp that prevents the drop entirely. Don't expect to have your own housemaid robot anytime soon. The gap between a controlled demo of a robot placing fruit in a basket and a robot reliably working in your home is enormous. RoboCasa365, LIBERO-Plus, RoboTwin-Clean2Rand -- these are simulation benchmarks. Real-world deployment introduces sensor noise, actuator drift, and the long tail of edge cases that have humbled every robotics effort in history, and Alibaba recognizes this. The technical achievements are real, though. RobotManip's alignment-first approach solves a genuine bottleneck in cross-embodiment training. RobotNav's parameterized observation interface is a clever solution to the context-strategy problem. RobotWorld's language-as-universal-action-interface is the right abstraction for cross-domain world modeling. Alibaba hasn't disclosed pricing, timelines, or which customers get access beyond pilot programs.
[3]
Alibaba Debuts Suite of AI Models for Robots | PYMNTS.com
The Qwen Robot Suite comes as AI companies shift away from chatbots and into physical AI. "The Qwen family of foundation models already gives strong perception and reasoning about the physical world," the post said. "But seeing is not acting. The gap between vision and language understanding and physical control remains the central bottleneck for embodied intelligence." The Qwen Robot Suite's three models close this gap, per the post. Qwen-RobotNav helps robots understand how to navigate physical spaces, while Qwen-RobotWorld is a video "world model" that allows a robot to predict how a physical scenario will unfold. Lastly, Qwen-RobotManip "turns heterogeneous robot data into a coherent canonical space, enabling cross-embodiment training at scale," the post said. "Together, they enable an agentic system where general intelligence translates directly into physical action," the post said. The launch comes one week after a report that Alibaba Group formed a new business unit known as Token Foundry as it reorganizes to strengthen its AI efforts. Led by Alibaba CEO Eddie Wu, Token Foundry will combine Alibaba's Tongyi Lab and Future Life business units, and operate under the company's recently created Alibaba Token Hub. In other physical AI news, last week saw the launch of Nvidia's Cosmos 3 foundational model for physical AI. Nvidia Founder and CEO Jensen Huang said at the launch that "the big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models." The distinction is important for anyone building or deploying physical AI. While a large language model learns from text, a world foundation model learns from physical environments. For robots, that means learning to handle objects with the help of millions of interaction examples, while an autonomous vehicle needs exposure to rare and dangerous scenarios that can't be safely or cheaply collected at scale on public streets. "World foundation models solve this by generating synthetic training data that reflects real physics," PYMNTS reported at the time. "Instead of driving a test fleet for years, an autonomous vehicle developer can run millions of simulated scenarios in days." For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.
[4]
Alibaba Launches Robotics AI Models as It Ramps Up Physical AI Push
Alibaba Group has rolled out a suite of artificial-intelligence models that can help robots better understand and perform real-world tasks, as tech companies ramp up a push into the fast-growing physical AI field. The foundational robotics models based on Alibaba's Qwen models will help robots to adapt to diverse environments, handling tasks in unfamiliar setting while following natural language instructions, the company said Tuesday. The newly introduced Qwen-Robot Suite comprises three core models: Qwen-RobotManip, a generalizable vision-language-action model; Qwen-RobotNav, a scalable vision-language navigation model; and Qwen-RobotWorld, a video world model designed for embodied intelligence. These models have entered real-world pilot testing with select Alibaba cloud enterprise customers within the robotics sector, Alibaba said. Even as Chinese AI startups including Moonshot AI and MiniMax are pushing aggressively on the large language model front, tech incumbents such as Alibaba and Baidu are seeking to build an ecosystem around AI from models to chips. Alibaba expects AI-related product revenue to become the primary driver of revenue growth for the cloud segment, Chief Executive Eddie Wu said earlier this year.
Share
Copy Link
Alibaba launched the Qwen-Robot Suite, a comprehensive set of AI models designed to help robots understand and operate in the physical world. The suite includes models for navigation, manipulation, and physics prediction, positioning Alibaba as a vertically integrated player spanning chips, cloud, and applications in China's emerging robot economy.
Alibaba has launched its first suite of AI models for robots, marking a strategic shift from conversational chatbots to physical AI systems capable of performing complex real-world tasks
1
. The Qwen-Robot Suite comprises three foundation models that form what the company describes as a "full stack for embodied intelligence"2
. This move positions Alibaba as the only company in China spanning all five layers of the AI stack, from chips through cloud infrastructure, models, serving platforms, and applications built on top1
.
Source: Decrypt
The newly introduced models address the fundamental gap between vision and language understanding and physical control, which remains the central bottleneck for embodied AI
3
. At the center of this effort is RynnBrain, a system built to help machines understand space, objects, and motion—the perceptual groundwork a robot needs before it can act in the physical world1
. Alongside the robotics suite, Alibaba announced Qwen3.7-Max, the latest in its proprietary large-language-model line, which can run autonomously for up to 35 hours without performance degrading1
.The Qwen-Robot Suite consists of three specialized models, each addressing distinct challenges in robotics. Qwen-RobotNav is a scalable vision-language navigation model that unifies five navigation tasks: instruction following, point-goal navigation, object search, target tracking, and autonomous driving
2
. Trained on 15.6 million samples, it achieves 76.5% success on VLN-CE RxR, a benchmark for vision-and-language navigation in real-world environments, and 90% tracking on EVT-Bench2
.Qwen-RobotManip is a generalizable vision-language-action model that tackles one of robotics' biggest challenges: different robots represent actions in fundamentally different ways
2
4
. To bridge these incompatible action spaces, Alibaba synthesized approximately 38,100 hours of training data from open-source robot datasets and human videos, without relying on proprietary data collection2
. The model ranks first on RoboChallenge Table30-v1, outperforming previous approaches by 20%2
.Qwen-RobotWorld represents the most ambitious component: a video-based world model designed for embodied intelligence that treats natural language as a universal action interface
2
3
. The Embodied World Knowledge corpus spans 8.6 million video-text pairs—200 million frames—across manipulation, autonomous driving, indoor navigation, and human-to-robot transfer across 14 robot arms2
. It ranks first on EWMBench and DreamGen Bench and scores perfectly on physics adherence, including Newton's laws, mass conservation, fluid dynamics, and gravity2
.The launch comes as Chinese firms, like their American counterparts, have concluded that the more lucrative business lies not in conversational models but in systems that can take actions—book, buy, operate, schedule—on a user's behalf
1
. Robotics is the most physical expression of that bet, extending AI agents from the screen into warehouses and homes1
. The company's physical AI push aligns with a national strategy that treats both AI and robotics as priorities, pairing a domestic model stack with China's manufacturing base in a vertical play that software-only rivals find harder to match1
.
Source: PYMNTS
These AI models for embodied intelligence have entered real-world pilot testing with select Alibaba Cloud enterprise customers within the robotics sector
4
. The launch came one week after Alibaba Group formed a new business unit known as Token Foundry, led by CEO Eddie Wu, combining the company's Tongyi Lab and Future Life business units to strengthen its AI efforts3
. Wu has stated that Alibaba expects AI-related product revenue to become the primary driver of revenue growth for the cloud segment4
.Related Stories
While Western labs including Google DeepMind, Nvidia, Figure, and Physical Intelligence pursue similar goals, most focus on navigation or manipulation separately, not a unified, composable suite
2
. The distinction between language models and world foundation models is critical: while a language model predicts tokens, these models must understand physics, spatial relationships, and consequences of physical actions2
. A language model tells you a glass breaks if dropped; Qwen-RobotWorld predicts how it breaks—shatter pattern, fluid dynamics, secondary collisions2
.The gap between controlled demonstrations and reliable real-world deployment remains enormous. The benchmarks these models excel on—RoboCasa365, LIBERO-Plus, RoboTwin-Clean2Rand—are simulation environments
2
. Real-world deployment introduces sensor noise, actuator drift, and the long tail of edge cases that have humbled every robotics effort in history1
. Alibaba has not detailed pricing, availability, or which customers will receive the robot models first1
. What the company has established is a position: a claim to span the whole stack at the moment the industry decides AI agents, not chatbots, are the prize1
.Summarized by
Navi
[4]
10 Feb 2026•Technology

09 Feb 2026•Technology

02 Apr 2026•Technology

1
Policy and Regulation

2
Policy and Regulation

3
Business and Economy
