Curated by THEOUTPOST
On Thu, 26 Sept, 8:03 AM UTC
3 Sources
[1]
Meet Molmo, the free model that could outshine GPT-4
The Allen Institute for AI (Ai2) has made public Molmo, an innovative set of open-source multimodal models that contest the guiding influence of proprietary AI systems. With strengths in superior image recognition and actionable insights, Molmo is ready to assist developers, researchers, and startups by delivering an advanced yet easy-to-use AI application development tool. The launch brings attention to an important change in the landscape of AI, uniting open-source and proprietary models and improving everyone's access to leading AI tech. Molmo offers features that provide an exceptional degree of image understanding, permitting it to correctly read a wide variety of visual data -- from mundane items to complex charts and menus. Instead of being like most AI models, Molmo surpasses perception by enabling users to interact with virtual and real environments through pointing and a range of spatial actions. This capability denotes a breakthrough, allowing for the introduction of complex AI agents, robotics, and many other applications that depend on a granular understanding of both visual and contextual data. Efficiency and accessibility serve as major aspects of the Molmo development strategy. Molmo's advanced skills come from a dataset of less than one million images, in stark contrast to the billions of images processed by other models such as GPT-4V and Google's Gemini. The implemented approach has contributed to Molmo being not just highly efficient in using computational resources but has also created a model that is equally powerful as the most effective proprietary systems and features fewer hallucinations and quicker training rates. Making Molmo fully open-source is part of Ai2's larger strategic effort to democratize AI development. Ai2 enables a diverse array of users -- from startups to academic laboratories -- to innovate and advance in AI technology without the high costs of investment or vast computing power. It gives them access to Molmo's language and vision training data, model weights, and source code. Matt Deitke, Researcher at the Allen Institute for AI, told "Molmo is an incredible AI model with exceptional visual understanding, which pushes the frontier of AI development by introducing a paradigm for AI to interact with the world through pointing. The model's performance is driven by a remarkably high quality curated dataset to teach AI to understand images through text. The training is so much faster, cheaper, and simpler than what's done today, such that the open release of how it is built will empower the entire AI community, from startups to academic labs, to work at the frontier of AI development". According to internal evaluations, Molmo's largest model, sporting 72 billion parameters, surpassed OpenAI's GPT-4V and other leading competitors on several benchmarks. The tiniest Molmo model, including only one billion parameters, is big enough to function on a mobile device while outperforming models with ten times that number of parameters. Here you can see the models and try it for yourself.
[2]
Ai2's new Molmo open source AI models beat GPT-4o, Claude on some benchmarks
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More The Allen Institute for AI (Ai2) today unveiled Molmo, an open-source family of state-of-the-art multimodal AI models which outpeform top proprietary rivals including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 on several third-party benchmarks. The models can therefore accept and analyze imagery uploaded to them by users, similar to the leading proprietary foundation models. Ai2 says the release underscores its commitment to open research by offering high-performing models, complete with open weights and data, to the broader community -- and of course, companies looking for solutions they can completely own, control, and customize. It comes on the heels of Ai2's release two weeks ago of another open model, OLMoE, which is a "mixture of experts" or combination of smaller models designed for cost effectiveness. Closing the Gap Between Open and Proprietary AI Molmo consists of four main models of different parameter sizes and capabilities: These models achieve high performance across a range of third-party benchmarks, outpacing many proprietary alternatives. And they're all available under permissive Apache 2.0 licenses, enabling virtually any sorts of usages for research and commercialization (e.g. enterprise grade). Notably, Molmo-72B leads the pack in academic evaluations, achieving the highest score on 11 key benchmarks and ranking second in user preference, closely following GPT-4o. Vaibhav Srivastav, a machine learning developer advocate engineer at AI code repository company Hugging Face, commented on the release on X, highlighting that Molmo offers a formidable alternative to closed systems, setting a new standard for open multimodal AI. In addition, Google DeepMind robotics reseracher Ted Xiao took to X to praise the inclusion of pointing data in Molmo, which he sees as a game-changer for visual grounding in robotics. This capability allows Molmo to provide visual explanations and interact more effectively with physical environments, a feature that is currently lacking in most other multimodal models. The models are not only high-performing but also entirely open, allowing researchers and developers to access and build upon cutting-edge technology. Advanced Model Architecture and Training Approach Molmo's architecture is designed to maximize efficiency and performance. All models use OpenAI's ViT-L/14 336px CLIP model as the vision encoder, which processes multi-scale, multi-crop images into vision tokens. These tokens are then projected into the language model's input space through a multi-layer perceptron (MLP) connector and pooled for dimensionality reduction. The language model component is a decoder-only Transformer, with options ranging from the OLMo series to the Qwen2 and Mistral series, each offering different capacities and openness levels. The training strategy for Molmo involves two key stages: Unlike many contemporary models, Molmo does not rely on reinforcement learning from human feedback (RLHF), focusing instead on a meticulously tuned training pipeline that updates all model parameters based on their pre-training status. Outperforming on Key Benchmarks The Molmo models have shown impressive results across multiple benchmarks, particularly in comparison to proprietary models. For instance, Molmo-72B scores 96.3 on DocVQA and 85.5 on TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories. It further outperforms GPT-4o on AI2D (Ai2's own benchmark, short for "A Diagram Is Worth A Dozen Images," a dataset of 5000+ grade school science diagrams and 150,000+ rich annotations) The models also excel in visual grounding tasks, with Molmo-72B achieving top performance on RealWorldQA, making it especially promising for applications in robotics and complex multimodal reasoning. Open Access and Future Releases Ai2 has made these models and datasets accessible on its Hugging Face space, with full compatibility with popular AI frameworks like Transformers. This open access is part of Ai2's broader vision to foster innovation and collaboration in the AI community. Over the next few months, Ai2 plans to release additional models, training code, and an expanded version of their technical report, further enriching the resources available to researchers. For those interested in exploring Molmo's capabilities, a public demo and several model checkpoints are available now via Molmo's official page.
[3]
Ai2 Unveils Molmo: A New Breed of Open-Source AI That Rivals Tech Giants - Decrypt
AI enthusiasts, rejoice: There's a new multimodal, large learning model for you to play with. Seattle-based non-profit AI research outfit, the Allen Institute for AI (Ai2), just introduced Molmo, a family of multimodal artificial intelligence models that promise to rival the capabilities of proprietary vision-based offers from major tech companies like Openai and Anthropic. Multimodal refers to the ability to handle different data types, including text, images, audio, video, and even sensory info. On Tuesday, Molmo debuted without the fanfare of all the major AI models but with all the bells and whistles of any state-of-the-art vision model. The system demonstrated remarkable proficiency in interpreting visual data, from everyday objects to complex charts and messy whiteboards. In a video demonstration, Ai2 showcased Molmo's ability to create AI agents capable of executing personalized tasks, such as ordering food and organizing handwritten data into properly formatted code. "This model pushes the boundaries of AI development by introducing a way for AI to interact with the world through pointing [out elements]." Matt Deitke, a researcher at Ai2, said in a statement. "Its performance is driven by a remarkably high-quality curated dataset that teaches AI to understand images through text." The system was trained on a curated dataset of nearly 1 million images -- a fraction of the billions typically used by competitors. Although small, this approach reduced computational requirements, showing fewer errors in AI responses, according to the model's research paper. Ani Kembhavi, senior director of research at Ai2, explained the rationale behind this strategy: "We've focused on using extremely high-quality data at a scale that is 1000 times smaller, Kembhavi said. "This has produced models that are as effective as the best proprietary systems, but with fewer inaccuracies and much faster training times." The Molmo family includes several models of varying sizes. MolmoE-1B is a mixture of expert models with 1 billion active parameters (7 billion total). Molmo-7B-O is the most open 7 billion parameter model. Molmo-7B-D, meanwhile, serves as a demonstration model. At the top of the range, Molmo-72B represents the most advanced model in the family. Initial evaluations suggest that even the smaller 7 billion parameter models perform comparably to more significant proprietary alternatives. This efficiency makes Molmo accessible to a broader range of developers and researchers, potentially accelerating innovation in the field. Molmo's development involved novel data collection methods. The team used speech-based image descriptions from human annotators, resulting in richer and more detailed captions. They also incorporated 2D pointing data, enhancing the model's ability to perform tasks like counting and object identification. The release of Molmo is phased. Initially, Ai2 is providing a demo, inference code, a research paper on arXiv, and select model weights. Over the next two months, the institute plans to release additional components, including a more comprehensive version of the technical report, the family of datasets used in training, additional model weights and checkpoints, and training and evaluation code. By making Molmo's code, data, and model weights publicly available, Ai2 aims to boost open AI research and innovation. This approach contrasts with the closed nature of many leading AI systems and could accelerate progress in the field. Decrypt tested the model, which demonstrated fairly decent results, outperforming Llava (the standard multimodal LLM in the open-source community) and matching ChatGPT and Reka in vision tasks. The chatbot, now publicly available, is free to use. The interface is pink, but it's pretty similar to your typical AI chatbot: A side panel with previous interactions, a main screen, and a text box in the lower part. However, this model is primarily designed for vision-related tasks, at least in its initial release. Text-only input is not possible; users must upload an image to initiate an interaction. The pre-prompted image+text samples on the welcome screen may give you a clue about how this model works. For example, it's impossible to trigger a simple query like "Why doesn't America like Putin?" but prompting a photograph of Vladimir Putin makes it possible to ask the model that specific question as the interaction is based on a mixture of image and text. And this was our first comparison. Upon showing a photo of Vladimir Putin, Molmo explained that the relationship between America and Putin is tense due to different factors like Historical tensions, Geopolitical competition and Human rights concerns among others. We put Molmo to the test against today's best models. For space reasons, we used one task per model to give a broad idea of how comparable Molmo is at first glance. Catching humor, nuances and subjective elements The model excels at understanding subtle elements in photos, including humor and unusual features. Our tests revealed its proficiency in grasping these more subjective aspects. For example, when presented with an AI-generated image of Putin and Kim Jong Un sharing a beer and asked why people found it amusing, Molmo correctly identified the image as nonsensical and created for entertainment purposes. "Given the low quality of the image and its nonsensical nature, it's no wonder your friends are laughing at it in your WhatsApp group. It's not a serious or meaningful image, but rather a poorly executed joke or meme that's likely to be met with amusement or mockery," Molmo said. "Your friends may also find humor in the absurdity of the situation, as this is not something people would associate with these two individuals," was ChatGPT's explanation Understanding data in charts and graphs The model also demonstrates proficiency in interpreting charts, performing on par with Reka. We presented a chart comparing ELO scores of different models within similar families and posed three questions: identifying the best overall model, counting the number of distinct model families, and evaluating the quality of a specific model with an incomplete name. These were some tricky questions. Molmo accurately identified "Flux Iprol" as the top-performing model, while Reka incorrectly named "Flux [Ibrol]." However, Reka better discerned the nuances in the second task, correctly grouping similar models into families and providing the accurate answer of 7 distinct model families. Molmo, in contrast, counted each model individually. For the third task, Molmo provided a more nuanced and direct response, acknowledging SD3 as a strong model and noting its position as the best in its family while mentioning other options. Reka's reply that "SD3 is not explicitly mentioned in the image" was technically accurate but less insightful, especially considering its ability to group different SD3 versions into a single family. Image description The model excels at describing image elements and identifying text. We compared its capabilities to Claude 3.5 Sonnet by asking both to describe all elements in a frame capture of Mr. William Saunders' testimony to the US Senate. Both models performed well enough, though Claude made more descriptive errors. For instance, it reversed the descriptions of elements on the right and left and mistook a woman for a younger man. Overall, Molmo shows promise as a valuable tool for users requiring a proficient vision model. It currently competes well with Reka, but it is outperforming it in certain areas. While Claude offers more versatility and power, it imposes daily interaction limits, which Molmo does not, making it a better option for power users. ChatGPT avoids such restrictions but requires a paid ChatGPT Plus subscription to access its vision capabilities.
Share
Share
Copy Link
AI2 introduces Molmo, a free and open-source AI model that outperforms GPT-4 and Claude on certain benchmarks. This development could potentially reshape the AI landscape and democratize access to advanced language models.
In a groundbreaking development, the Allen Institute for AI (AI2) has unveiled Molmo, a series of open-source AI models that are making waves in the artificial intelligence community. These models, which are freely available to the public, have demonstrated performance levels that rival or even surpass those of industry giants like OpenAI's GPT-4 and Anthropic's Claude 1.
Molmo's capabilities have been put to the test across various benchmarks, and the results are nothing short of impressive. On the challenging Massive Multitask Language Understanding (MMLU) benchmark, Molmo-34B achieved a score of 69.4%, outperforming GPT-4 (0613) which scored 68.9% 2. This benchmark covers a wide range of subjects, including humanities, STEM, and more, making Molmo's performance particularly noteworthy.
One of the most significant aspects of Molmo is its open-source nature. Unlike proprietary models like GPT-4, Molmo's code is freely available on GitHub, allowing researchers and developers to study, modify, and build upon it 3. This openness not only fosters innovation but also promotes transparency in AI development.
AI2 has released Molmo in various sizes, ranging from 8 billion to 34 billion parameters. This range provides options for different computational requirements and applications. The smaller models, while not as powerful as their larger counterparts, still offer impressive performance and can be run on more modest hardware 1.
The success of Molmo can be attributed to AI2's innovative training approach. The team employed a technique called "mixture-of-experts" (MoE), which allows the model to specialize in different tasks. This method, combined with a dataset of over 3 trillion tokens, has resulted in models that are both efficient and highly capable 2.
Molmo's release could have far-reaching implications for AI accessibility. By providing free, high-performance models, AI2 is potentially democratizing access to advanced AI capabilities. This could lead to increased innovation and application of AI across various sectors, from academia to small businesses 3.
Despite its impressive performance, Molmo does have limitations. The models currently lack instruction-following capabilities, which are crucial for many real-world applications. Additionally, the larger models still require significant computational resources to run effectively 2.
AI2 has indicated that they plan to continue developing and improving Molmo. Future versions may include instruction-following capabilities and further performance enhancements. The open-source nature of the project means that the wider AI community can also contribute to its development 1.
As Molmo continues to evolve, it represents a significant step towards more accessible and transparent AI technologies. Its emergence challenges the dominance of proprietary models and could potentially reshape the landscape of artificial intelligence research and application.
Researchers at the Allen Institute for AI have developed Molmo, an open-source multimodal AI model that rivals proprietary models in performance while being significantly smaller and more efficient.
3 Sources
3 Sources
The Allen Institute for AI (Ai2) has unveiled OLMo 2, a family of open-source language models that compete with leading AI models while adhering to open-source principles, potentially reshaping the landscape of accessible AI technology.
3 Sources
3 Sources
OpenAI, the company behind ChatGPT, plans to release its first open-weight language model since GPT-2 in 2019. This strategic shift comes as the AI industry faces increasing pressure from open-source competitors and changing economic realities.
20 Sources
20 Sources
Genmo releases Mochi 1, an open-source text-to-video AI model, offering high-quality video generation capabilities comparable to proprietary models. The launch is accompanied by a $28.4 million Series A funding round.
4 Sources
4 Sources
NVIDIA has released an open-source large language model with 72 billion parameters, positioning it as a potential competitor to OpenAI's GPT-4. This move marks a significant shift in NVIDIA's AI strategy and could reshape the AI landscape.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved