2 Sources
2 Sources
[1]
Allen Institute for AI rivals Google, Meta and OpenAI with open-source AI vision model
How many penguins are in this wildlife video? Can you track the orange ball in the cat video? Which teams are playing, and who scored? Give me step-by-step instructions from this cooking video? Those are examples of queries that can be fielded by Molmo 2, a new family of open-source AI vision models from the Allen Institute for AI (Ai2) that can watch, track, analyze and answer questions about videos: describing what's happening, and pinpointing exactly where and when. Ai2 cites benchmark tests showing Molmo 2 beating open-source models on short video analysis and tracking, and surpassing closed systems like Google's Gemini 3 on video tracking, while approaching their performance on other image and video tasks. In a series of demos for reporters recently at the Ai2 offices in Seattle, researchers showed how Molmo 2 could analyze a variety of short video clips in different ways. * In a soccer clip, researchers asked what defensive mistake led to a goal. The model analyzed the sequence and pointed to a failure to clear the ball effectively. * In a baseball clip, the AI identified the teams (Angels and Mariners), the player who scored (#55), and explained how it knew the home team by reading uniforms and stadium branding. * Given a cooking video, the model returned a structured recipe with ingredients and step-by-step instructions, including timing pulled from on-screen text. * Asked to count how many flips a dancer performed, the model didn't just say "five" -- it returned timestamps and pixel coordinates for each one. * In a tracking demo, the model followed four penguins as they moved around the frame, maintaining a consistent ID for each bird even when they overlapped. * When asked to "track the car that passes the #13 car in the end," the model watched an entire racing clip first, understood the query, then went back and identified the correct vehicle. It tracked cars that went in and out of frame. Big year for Ai2 Molmo 2, announced Tuesday morning, caps a year of major milestones for the Seattle-based nonprofit, which has developed a loyal following in business and scientific circles by building fully open AI systems. Its approach contrasts sharply with the closed or partially open approaches of industry giants like OpenAI, Google, Microsoft, and Meta. Founded in 2014 by the late Microsoft co-founder Paul Allen, Ai2 this year landed $152 million from the NSF and Nvidia, partnered on an AI cancer research initiative led by Seattle's Fred Hutch, and released Olmo 3, a text model rivaling Meta, DeepSeek and others. Ai2 has seen more than 21 million downloads of its models this year and nearly 3 billion queries across its systems, said Ali Farhadi, the Ai2 CEO, during the media briefing last week at the institute's new headquarters on the northern shore of Seattle's Lake Union. As a nonprofit, Ai2 isn't trying to compete commercially with the tech giants -- it's aiming to advance the state of the art and make those advances freely available. The institute has released open models for text (OLMo), images (the original Molmo), and now video -- building toward what he described as a unified model that reasons across all modalities. "We're basically building models that are competitive with the best things out there," Farhadi said -- but in a completely open manner, for a succession of different media and situations. In addition to Molmo 2, Ai2 on Monday released Bolmo, an experimental text model that processes language at the character level rather than in word fragments -- a technical shift that improves handling of spelling, rare words, and multilingual text. Expanding into video analysis With the newly released Molmo 2, the focus is video. To be clear: the model analyzes video, it doesn't generate video -- think understanding footage rather than creating it. The original Molmo, released last September, could analyze static images with precision rivaling closed-source competitors. It introduced a "pointing" capability that let it identify specific objects within a frame. Molmo 2 brings that same approach to video and multi-image understanding. The concept isn't new. Google's Gemini, OpenAI's GPT-4o, and Meta's Perception LM can all process video. But in line with Ai2's broader mission as a nonprofit institute, Molmo 2 is fully open, with its model weights, training code, and training data all publicly released. That's different from "open weight" models that release the final product but not the original recipe, and a stark contrast to closed systems from Google, OpenAI and others. The distinction is not just an academic principle. Ai2's approach means developers can trace a model's behavior back to its training data, customize it for specific uses, and avoid being locked into a vendor's ecosystem. Ai2 also emphasizes efficiency. For example, Meta's Perception LM was trained on 72.5 million videos. Molmo 2 used about 9 million, relying on high-quality human annotations. The result, Ai2 claims, is a smaller, more efficient model that outperforms their own much larger model from last year, and comes close to matching commercial systems from Google and OpenAI, while being simple enough to run on a single machine. When the original Molmo introduced its pointing capability last year -- allowing the model to identify specific objects in an image -- competing models quickly adopted the feature. "We know they adopted our data because they perform exactly as well as we do," said Ranjay Krishna, who leads Ai2's computer vision team. Krishna is also a University of Washington assistant professor, and several of his graduate students also work on the project. Farhadi frames the competitive dynamic differently than most in the industry. "If you do real open source, I would actually change the word competition to collaboration," he said. "Because there is no need to compete. Everything is out there. You don't need to reverse engineer. You don't need to rebuild it. Just grab it, build on top of it, do the next thing. And we love it when people do that." A work in progress At the same time, Molmo 2 has some clear constraints. The tracking capability -- following objects across frames -- currently tops out at about 10 items. Ask it to track a crowd or a busy highway, and the model can't keep up. "This is a very, very new capability, and it's one that's so experimental that we're starting out very small," Krishna said. "There's no technological limit to this, it just requires more data, more examples of really crowded scenes." Long-form video also remains a challenge. The model performs well on short clips, but analyzing longer footage requires compute that Ai2 isn't yet willing to spend. In the playground launching alongside Molmo 2, uploaded videos are limited to 15 seconds. And unlike some commercial systems, Molmo 2 doesn't process live video streams. It analyzes recordings after the fact. Krishna said the team is exploring streaming capabilities for applications like robotics, where a model would need to respond to observations in real time, but that work is still early. "There are methods that people have come up with in terms of processing videos over time, streaming videos," Krishna said. "Those are directions we're looking into next." Molmo 2 is available starting today on Hugging Face and Ai2's playground.
[2]
Allen Institute for AI introduces Molmo 2, bringing open video understanding to AI systems - SiliconANGLE
Allen Institute for AI introduces Molmo 2, bringing open video understanding to AI systems Building on the company's foundation of image understanding artificial intelligence models, the Allen Institute for AI today introduced Molmo 2, a multimodal model family adapted to computer video and multi-image understanding. In 2024, Ai2 released Molmo, which set a new benchmark for image understanding and helped establish a reference for powerful "pointing" and tagging capabilities. Those models went beyond describing what appeared in an image; they could identify and tag objects with a high degree of confidence. The Molmo 2 family includes three variants, each designed for different use cases: Molmo 2 8B, Molmo 2 4B and Molmo 2-O 7B. The 8B and 4B models are based on Qwen 3, Alibaba Group Holding Ltd.'s open-weights reasoning models, and provide video grounding and question-answering capabilities. The Molmo 2-O variant is built on Olmo, Ai2's open-source model family focused on high intelligence and reasoning performance. According to Ai2, the smaller Molmo 2 models deliver outsized performance relative to their size. The 8B model exceeds the original Molmo 72 billion-parameter model on key image understanding tasks and related benchmarks, setting a new standard for efficiency. On image and multi-image reasoning, the 4B variant still excels at reasoning, in spite of its extremely compact size. It exceeds open models such as Qwen 3-VL-8B and is trained on far less data than similar models. It uses only 9.19 million videos compared to 72.5 million for Meta Platform Inc.'s PerceptronLM. These smaller sizes allow the model's efficient deployment using less hardware, lowering costs while increasing availability to essential capabilities. "With Olmo, we set the standard for truly open AI, then last year Molmo ushered the industry toward pointing; Molmo 2 pushes it even further by bringing these capabilities to videos and temporal domains," said Ali Farhadi, chief executive of Ai2. Models such as Molmo 2 form a foundation for assistive and intelligent physical technologies, often referred to as Physical AI. These systems perceive, understand and reason about the real world to interact with it meaningfully. For machines to interact with their environment, they must first understand what they are observing. Humans perform this task intuitively, but machines require AI models that can segment objects, track them over time, tag them consistently and assign expected properties. Ai2 said Molmo 2 introduces capabilities to video understanding that no prior open model has delivered. This includes identifying exactly where and when events occur, tracking multiple objects through complex scenes and connecting actions to frame-level timelines. This improved understanding of the physical world is essential for intelligent systems such as traffic cameras, retail item-tracking platforms, safety monitoring systems, autonomous vehicles and robotics. Rapid categorization of objects in a field of view, along with their inherent characteristics, enables machines to reason about what may happen next. This capability is critical not only for interaction but also for safety. Understanding what a robot is observing fundamentally changes how it chooses to respond. Additionally, Ai2 is releasing a collection of nine new open datasets used to train Molmo 2, totaling more than nine million multimodal examples across dense video captions, long-form QA grounding, tracking and multi-image reasoning. The captioning dataset alone spans more than one hundred thousand videos with detailed descriptions that average more than nine hundred words each. According to the company, the corpus of datasets provides a mix of video pointing, multi-object tracking, synthetic grounding and long-video reasoning. Combined, they create the foundation for the most complete open video data collections available today. All models, datasets and evaluation tools are now publicly available on GitHub, Hugging Face and Ai2 Playground for interactive testing. The company said it will release the training code soon.
Share
Share
Copy Link
The Allen Institute for AI unveiled Molmo 2, a family of open-source AI vision models that can watch, track, and analyze videos with precision. The model surpasses Google's Gemini on video tracking tasks while using just 9 million training videos compared to Meta's 72.5 million. Unlike closed systems from tech giants, Molmo 2 is fully open, releasing model weights, training code, and datasets publicly.
The Allen Institute for AI has released Molmo 2, a new family of open-source AI vision models designed to analyze, track, and answer questions about video content with remarkable precision
1
. Building on the success of the original Molmo released in September 2024, this latest iteration brings advanced video understanding capabilities that rival closed systems from Google, OpenAI, and Meta. According to benchmark tests, Molmo 2 beats open-source models on short video analysis and surpasses Google's Gemini 3 on video tracking tasks1
.
Source: SiliconANGLE
The Seattle-based nonprofit founded by late Microsoft co-founder Paul Allen has built a reputation for fully open-source AI development, contrasting sharply with the closed or partially open approaches of industry giants. Ali Farhadi, CEO of the Allen Institute for AI, emphasized the organization's commitment during a media briefing, stating that they're "basically building models that are competitive with the best things out there" while maintaining complete openness
1
.The Molmo 2 family includes three distinct variants tailored for different use cases: Molmo 2 8B, Molmo 2 4B, and Molmo 2-O 7B
2
. The 8B and 4B models are based on Qwen 3, Alibaba's open-weights reasoning models, while the Molmo 2-O variant builds on OLMo, Ai2's own open-source model family focused on high intelligence and reasoning performance2
.What sets these models apart is their efficiency. The 8B model exceeds the original Molmo 72 billion-parameter model on key image understanding tasks, setting a new standard for performance relative to size
2
. The compact 4B variant excels at reasoning despite its small footprint, outperforming open models like Qwen 3-VL-8B while using significantly less training data.During demonstrations at Ai2's Seattle offices, researchers showcased Molmo 2's ability to handle video and multi-image understanding tasks with impressive accuracy
1
. In a soccer clip, the model identified defensive mistakes leading to a goal. When analyzing baseball footage, it recognized the Angels and Mariners, identified player #55 who scored, and explained how it determined the home team by reading uniforms and stadium branding1
.The model's tracking capabilities proved particularly robust. In one demonstration, it followed four penguins moving around a frame, maintaining consistent IDs even when they overlapped. When asked to count dancer flips, it didn't just provide a number—it returned timestamps and pixel coordinates for each flip
1
. In a racing scenario, the model understood the query "track the car that passes the #13 car in the end," watched the entire clip, then identified and tracked the correct vehicle even as cars moved in and out of frame.Related Stories
Models like Molmo 2 form the foundation for Physical AI applications—systems that perceive, understand, and reason about the real world to interact meaningfully with it
2
. This capability is critical for robotics, autonomous vehicles, traffic cameras, retail item-tracking platforms, and safety monitoring systems. For machines to interact safely with their environment, they must first understand what they're observing—segmenting objects, tracking them over time, and assigning expected properties2
.The institute has seen more than 21 million downloads of its models this year and nearly 3 billion queries across its systems
1
. This year also brought $152 million in funding from the NSF and Nvidia, partnerships on AI cancer research with Seattle's Fred Hutch, and the release of OLMo 3, a text model rivaling Meta and DeepSeek1
.Ai2 is releasing nine new open datasets totaling more than 9 million multimodal examples across dense video captions, video grounding, tracking, and multi-image reasoning
2
. The captioning dataset alone spans over 100,000 videos with detailed descriptions averaging more than 900 words each. This approach emphasizes quality over quantity—Molmo 2 used approximately 9 million training videos compared to Meta's PerceptronLM, which required 72.5 million1
.Unlike "open weight" models that release only the final product, Molmo 2 provides model weights, training code, and training data publicly
1
. This enables developers to trace a model's behavior back to its training data, customize it for specific uses, and avoid vendor lock-in. All models, datasets, and evaluation tools are now available on GitHub, Hugging Face, and Ai2 Playground for interactive testing2
. The institute plans to release training code soon, further cementing its commitment to open-source AI development and advancing computer vision capabilities for the broader research community.Summarized by
Navi
1
Technology

2
Technology

3
Policy and Regulation
