3 Sources
[1]
Ai2 unveils MolmoAct: Open-source robotics system reasons in 3D and adjusts on the fly
The Allen Institute for AI released a new AI robotics system that uses novel approaches to help robots navigate messy real-world environments, while making all of the model's code, data, and training methods publicly available under open-source principles. The system, called MolmoAct, converts 2D images into 3D visualizations, previews its movements before acting, and lets human operators adjust those actions in real time. It differs from existing robotics models that often work as opaque black boxes, trained on proprietary datasets. Ai2 expects the system to be used by robotics researchers, companies, and developers as a foundation for building robots that can operate in unstructured environments such as homes, warehouses, and disaster response scenes. In demos last week at Ai2's new headquarters north of Seattle's Lake Union, researchers showed MolmoAct interpreting natural language commands to direct a robotic arm to pick up household objects, such as cups and plush toys, and move them to specific locations. Researchers described it as part of AI2's broader efforts to create a comprehensive set of open-source AI tools and technologies. The Seattle-based research institute was founded in 2014 by the late Microsoft co-founder Paul Allen, and is funded in part by his estate. Ai2's flagship OLMo large language model is a fully transparent alternative to proprietary systems, with openly available training data, code, and model weights, designed to support research and public accountability in AI development. The institute's projects are moving in "one big direction" -- toward a unified AI model "that can do reasoning and language, that can understand images, videos, that can control a robot, and that can make sense of space and actions," said Ranjay Krishna, Ai2's research lead for computer vision, and a University of Washington Allen School assistant professor. MolmoAct builds on AI2's Molmo multimodal AI model -- which can understand and describe images -- by adding the ability to reason in 3D and direct robot actions.
[2]
AI2's MolmoAct model 'thinks in 3D' to challenge Nvidia and Google in robotics AI
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now Physical AI, where robotics and foundation models come together, is fast becoming a growing space with companies like Nvidia, Google and Meta releasing research and experimenting in melding large language models (LLMs) with robots. New research from the Allen Institute for AI (Ai2) aims to challenge Nvidia and Google in physical AI with the release of MolmoAct 7B, a new open-source model that allows robots to "reason in space. MolmoAct, based on Ai2's open source Molmo, "thinks" in three dimensions. It is also releasing its training data. Ai2 has an Apache 2.0 license for the model, while the datasets are licensed under CC BY-4.0. Ai2 classifies MolmoAct as an Action Reasoning Model, in which foundation models reason about actions within a physical, 3D space. What this means is that MolmoAct can use its reasoning capabilities to understand the physical world, plan how it occupies space and then take that action. "MolmoAct has reasoning in 3D space capabilities versus traditional vision-language-action (VLA) models," Ai2 told VentureBeat in an email. "Most robotics models are VLAs that don't think or reason in space, but MolmoAct has this capability, making it more performant and generalizable from an architectural standpoint." Physical understanding Since robots exist in the physical world, Ai2 claims MolmoAct helps robots take in their surroundings and make better decisions on how to interact with them. "MolmoAct could be applied anywhere a machine would need to reason about its physical surroundings," the company said. "We think about it mainly in a home setting because that's where the greatest challenge lies for robotics, because there things are irregular and constantly changing, but MolmoAct can be applied anywhere." MolmoAct can understand the physical world by outputting "spatially grounded perception tokens," which are tokens pretrained and extracted using a vector-quantized variational autoencoder or a model that converts data inputs, such as video, into tokens. The company said these tokens differ from those used by VLAs in that they are not text inputs. These enable MolmoAct to gain spatial understanding and encode geometric structures. With these, the model estimates the distance between objects. Once it has an estimated distance, MolmoAct then predicts a sequence of "image-space" waypoints or points in the area where it can set a path to. After that, the model will begin outputting specific actions, such as dropping an arm by a few inches or stretching out. Ai2's researchers said they were able to get the model to adapt to different embodiments (i.e., either a mechanical arm or a humanoid robot) "with only minimal fine-tuning." Benchmarking testing conducted by Ai2 showed MolmoAct 7B had a task success rate of 72.1%, beating models from Google, Microsoft and Nvidia. A small step forward Ai2's research is the latest to take advantage of the unique benefits of LLMs and VLMs, especially as the pace of innovation in generative AI continues to grow. Experts in the field see work from Ai2 and other tech companies as building blocks. Alan Fern, professor at the Oregon State University College of Engineering, told VentureBeat that Ai2's research "represents a natural progression in enhancing VLMs for robotics and physical reasoning." "While I wouldn't call it revolutionary, it's an important step forward in the development of more capable 3D physical reasoning models," Fern said. "Their focus on truly 3D scene understanding, as opposed to relying on 2D models, marks a notable shift in the right direction. They've made improvements over prior models, but these benchmarks still fall short of capturing real-world complexity and remain relatively controlled and toyish in nature." He added that while there's still room for improvement on the benchmarks, he is "eager to test this new model on some of our physical reasoning tasks." Daniel Maturana, co-founder of the start-up Gather AI, praised the openness of the data, noting that "this is great news because developing and training these models is expensive, so this is a strong foundation to build on and fine-tune for other academic labs and even for dedicated hobbyists." Increasing interest in physical AI It has been a long-held dream for many developers and computer scientists to create more intelligent, or at least more spatially aware, robots. However, building robots that process what they can "see" quickly and move and react smoothly gets difficult. Before the advent of LLMs, scientists had to code every single movement. This naturally meant a lot of work and less flexibility in the types of robotic actions that can occur. Now, LLM-based methods allow robots (or at least robotic arms) to determine the following possible actions to take based on objects it is interacting with. Google Research's SayCan helps a robot reason about tasks using an LLM, enabling the robot to determine the sequence of movements required to achieve a goal. Meta and New York University's OK-Robot uses visual language models for movement planning and object manipulation. Hugging Face released a $299 desktop robot in an effort to democratize robotics development. Nvidia, which proclaimed physical AI to be the next big trend, released several models to fast-track robotic training, including Cosmos-Transfer1. OSU's Fern said there's more interest in physical AI even though demos remain limited. However, the quest to achieve general physical intelligence, which eliminates the need to individually program actions for robots, is becoming easier. "The landscape is more challenging now, with less low-hanging fruit. On the other hand, large physical intelligence models are still in their early stages and are much more ripe for rapid advancements, which makes this space particularly exciting," he said.
[3]
Ai2 releases an open AI model that allows robots to 'plan' movements in 3D space - SiliconANGLE
Ai2 releases an open AI model that allows robots to 'plan' movements in 3D space Seattle-based artificial intelligence research institute Ai2, the Allen Institute for AI, today announced the release of MolmoAct 7B, a breakthrough open embodied AI model that brings intelligence to robotics by allowing them to "think" through actions before performing. Spatial reasoning isn't new for AI models, which are capable of reasoning about the world by visualizing images or video and then drawing conclusions about them. For example, a user can upload an image or video to OpenAI's ChatGPT and ask questions about how to assemble a desk and receive an answer. Similarly, robotics AI foundation models can be told to pick up a cup and place it in the sink. "Embodied AI needs a new foundation that prioritizes reasoning, transparency and openness," said Chief Executive Ali Farhadi. "With MolmoAct, we're not just releasing a model; we're laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world." Most robotics AI models operate by reasoning about the language provided to them, breaking down natural language sentences -- such as the example above, "Pick up the cup on the counter and put it in the sink" -- and turning them into actions. They do this by combining knowledge gained from cameras and other sensors and a command. Ai2 said MolmoAct is the first in a new category of AI models the company is calling an action reasoning model, or ARM, that interprets high-level natural language and then reasons through a plan of physical actions to carry them out in the real world. Unlike current robotics models on the market that operate as vision language action foundation models, ARMs break down instructions into a series of waypoints and actions that take into account what the model can see. "As soon as it sees the world, it lifts the entire world into 3D and then it draws a trajectory to define how its arms are going to move in that space," Ranjay Krishna, the computer vision team lead at Ai2, told SiliconANGLE in an interview. "So, it plans for the future. And after it's done planning, only then does it start taking actions and moving its joints." Both ARM and VLA models act as "brains" for robots and include examples such as pi-zero from AI model robotics startup Physical Intelligence, Nvidia Corp.'s GR00T N1 for humanoid robots, OpenVLA, a 7 billion-parameter open-source model commonly used by academic researchers for experiments, and Octo, a 93 billion-parameter model. Parameters refer to the number of internal variables the model uses to make decisions and predictions. MolmoAct contains 7 billion parameters, hence the 7B in its name. The company used 18 million samples on a cluster of 256 Nvidia H100 graphics processing units to train the model, finishing pre-training in about a day. The fine tuning took 64 H100s about 2 hours. By comparison Nvidia's GR00T-N2-2B was trained on 600 million samples with 1,024 H100s, while Physical Intelligence trained pi-zero using 900 million samples and an undisclosed number of chips. "A lot of these companies give you these tech reports, but these tech reports kind of look like this: They have this big black box in the middle that says, 'transformer,' right? And beyond that, you really don't know what's going on," said Krishna. Unlike many current models on the market, MolmoAct 7B was trained on a curated open dataset of around 12,000 "robot episodes" from real-world environments, such as kitchens and bedrooms. These demonstrations were used to map goal-oriented actions -- such as arranging pillows and putting away laundry. Krishna explained that MolmoAct overcomes this industry transparency challenge by being fully open, providing its code, weights and evaluations, thus resolving the "black box problem." It is both trained on open data and its inner workings are transparent and openly available. To add even more control, users can preview the model's planned movements before execution, with its intended motion trajectories overlaid on camera images. These plans can be modified using natural language or by sketching corrections on a touchscreen. This provides a fine-grained method for developers or robotics technicians to control robots in different settings such as homes, hospitals and warehouses. Ai2 said the company evaluated MolmoAct's pre-training capabilities using SimPLER, a benchmark that uses a set of simulated test environments for common real-world robot setups. Using the benchmark, the model achieved state-of-the-art task success rates of 72.1%, beating models from Physical Intelligence, Google LLC, Microsoft Corp. and Nvidia. "MolmoAct is our first sort of foray into this space showing that reasoning models are the right way of going for training these large-scale foundation models for robotics," said Krishna. "Our mission is to enable real world applications, so anybody out there can download our model and then fine tune it for any sort of purposes that they have, or try using it out of the box."
Share
Copy Link
AI2 releases MolmoAct, an open-source AI model that enables robots to reason and plan movements in 3D space, challenging industry giants in the field of physical AI and robotics.
The Allen Institute for AI (AI2) has unveiled MolmoAct, a groundbreaking open-source AI model that promises to revolutionize the field of robotics. This new system enables robots to reason and plan movements in three-dimensional space, marking a significant advancement in physical AI technology 1.
Source: SiliconANGLE
MolmoAct stands out from traditional robotics models by its ability to "think" in 3D. The system converts 2D images into 3D visualizations, allowing robots to preview their movements before acting. This spatial reasoning capability enables robots to better understand and interact with their physical surroundings 2.
Key features of MolmoAct include:
AI2's decision to make MolmoAct fully open-source sets it apart in an industry often characterized by proprietary systems. The model's code, data, and training methods are publicly available, promoting transparency and facilitating further research and development 1.
This open approach challenges industry giants like Nvidia and Google, who have also been exploring the intersection of robotics and foundation models. AI2's Chief Executive, Ali Farhadi, emphasized that MolmoAct is "laying the groundwork for a new era of AI, bringing the intelligence of powerful AI models into the physical world" 3.
Source: VentureBeat
MolmoAct 7B, named for its 7 billion parameters, was trained on a curated dataset of around 12,000 "robot episodes" from real-world environments. The training process utilized 256 Nvidia H100 GPUs and took approximately one day to complete 3.
In benchmark testing using SimPLER, MolmoAct achieved a task success rate of 72.1%, outperforming models from competitors such as Physical Intelligence, Google, Microsoft, and Nvidia 2.
Source: GeekWire
AI2 envisions MolmoAct being used in various settings, including homes, warehouses, and disaster response scenes. The model's ability to adapt to different robot embodiments with minimal fine-tuning makes it versatile for a wide range of applications 1.
Ranjay Krishna, AI2's computer vision team lead, highlighted the model's potential: "Our mission is to enable real-world applications, so anybody out there can download our model and then fine-tune it for any sort of purposes that they have, or try using it out of the box" 3.
As the field of physical AI continues to evolve, MolmoAct represents a significant step towards more intelligent and adaptable robotic systems, potentially transforming industries and accelerating innovation in AI-powered robotics.
Summarized by
Navi
NVIDIA CEO Jensen Huang confirms the development of the company's most advanced AI architecture, 'Rubin', with six new chips currently in trial production at TSMC.
2 Sources
Technology
17 hrs ago
2 Sources
Technology
17 hrs ago
Databricks, a leading data and AI company, is set to acquire machine learning startup Tecton to bolster its AI agent offerings. This strategic move aims to improve real-time data processing and expand Databricks' suite of AI tools for enterprise customers.
3 Sources
Technology
17 hrs ago
3 Sources
Technology
17 hrs ago
Google is providing free users of its Gemini app temporary access to the Veo 3 AI video generation tool, typically reserved for paying subscribers, for a limited time this weekend.
3 Sources
Technology
9 hrs ago
3 Sources
Technology
9 hrs ago
Broadcom's stock rises as the company capitalizes on the AI boom, driven by massive investments from tech giants in data infrastructure. The chipmaker faces both opportunities and challenges in this rapidly evolving landscape.
2 Sources
Technology
17 hrs ago
2 Sources
Technology
17 hrs ago
Apple is set to introduce new enterprise-focused AI tools, including ChatGPT configuration options and potential support for other AI providers, as part of its upcoming software updates.
2 Sources
Technology
17 hrs ago
2 Sources
Technology
17 hrs ago