Allen Institute for AI releases Molmo 2, challenging Google and OpenAI with open video analysis

Reviewed byNidhi Govil

2 Sources

Share

The Allen Institute for AI unveiled Molmo 2, a family of open-source AI vision models that can watch, track, and analyze videos with precision. The model surpasses Google's Gemini on video tracking tasks while using just 9 million training videos compared to Meta's 72.5 million. Unlike closed systems from tech giants, Molmo 2 is fully open, releasing model weights, training code, and datasets publicly.

Allen Institute for AI Challenges Tech Giants with Molmo 2

The Allen Institute for AI has released Molmo 2, a new family of open-source AI vision models designed to analyze, track, and answer questions about video content with remarkable precision

1

. Building on the success of the original Molmo released in September 2024, this latest iteration brings advanced video understanding capabilities that rival closed systems from Google, OpenAI, and Meta. According to benchmark tests, Molmo 2 beats open-source models on short video analysis and surpasses Google's Gemini 3 on video tracking tasks

1

.

Source: SiliconANGLE

Source: SiliconANGLE

The Seattle-based nonprofit founded by late Microsoft co-founder Paul Allen has built a reputation for fully open-source AI development, contrasting sharply with the closed or partially open approaches of industry giants. Ali Farhadi, CEO of the Allen Institute for AI, emphasized the organization's commitment during a media briefing, stating that they're "basically building models that are competitive with the best things out there" while maintaining complete openness

1

.

Multimodal AI Models with Unprecedented Precision

The Molmo 2 family includes three distinct variants tailored for different use cases: Molmo 2 8B, Molmo 2 4B, and Molmo 2-O 7B

2

. The 8B and 4B models are based on Qwen 3, Alibaba's open-weights reasoning models, while the Molmo 2-O variant builds on OLMo, Ai2's own open-source model family focused on high intelligence and reasoning performance

2

.

What sets these models apart is their efficiency. The 8B model exceeds the original Molmo 72 billion-parameter model on key image understanding tasks, setting a new standard for performance relative to size

2

. The compact 4B variant excels at reasoning despite its small footprint, outperforming open models like Qwen 3-VL-8B while using significantly less training data.

Object Tracking in Complex Scenes and Real-World Applications

During demonstrations at Ai2's Seattle offices, researchers showcased Molmo 2's ability to handle video and multi-image understanding tasks with impressive accuracy

1

. In a soccer clip, the model identified defensive mistakes leading to a goal. When analyzing baseball footage, it recognized the Angels and Mariners, identified player #55 who scored, and explained how it determined the home team by reading uniforms and stadium branding

1

.

The model's tracking capabilities proved particularly robust. In one demonstration, it followed four penguins moving around a frame, maintaining consistent IDs even when they overlapped. When asked to count dancer flips, it didn't just provide a number—it returned timestamps and pixel coordinates for each flip

1

. In a racing scenario, the model understood the query "track the car that passes the #13 car in the end," watched the entire clip, then identified and tracked the correct vehicle even as cars moved in and out of frame.

Physical AI Applications and Industry Impact

Models like Molmo 2 form the foundation for Physical AI applications—systems that perceive, understand, and reason about the real world to interact meaningfully with it

2

. This capability is critical for robotics, autonomous vehicles, traffic cameras, retail item-tracking platforms, and safety monitoring systems. For machines to interact safely with their environment, they must first understand what they're observing—segmenting objects, tracking them over time, and assigning expected properties

2

.

The institute has seen more than 21 million downloads of its models this year and nearly 3 billion queries across its systems

1

. This year also brought $152 million in funding from the NSF and Nvidia, partnerships on AI cancer research with Seattle's Fred Hutch, and the release of OLMo 3, a text model rivaling Meta and DeepSeek

1

.

Open Datasets and Training Efficiency

Ai2 is releasing nine new open datasets totaling more than 9 million multimodal examples across dense video captions, video grounding, tracking, and multi-image reasoning

2

. The captioning dataset alone spans over 100,000 videos with detailed descriptions averaging more than 900 words each. This approach emphasizes quality over quantity—Molmo 2 used approximately 9 million training videos compared to Meta's PerceptronLM, which required 72.5 million

1

.

Unlike "open weight" models that release only the final product, Molmo 2 provides model weights, training code, and training data publicly

1

. This enables developers to trace a model's behavior back to its training data, customize it for specific uses, and avoid vendor lock-in. All models, datasets, and evaluation tools are now available on GitHub, Hugging Face, and Ai2 Playground for interactive testing

2

. The institute plans to release training code soon, further cementing its commitment to open-source AI development and advancing computer vision capabilities for the broader research community.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo