Curated by THEOUTPOST
On Wed, 25 Sept, 4:06 PM UTC
3 Sources
[1]
The Most Capable Open Source AI Model Yet Could Supercharge AI Agents
A compact and fully open source visual AI model will make it easier for AI to take control of your computer -- hopefully in a good way. The most capable open source AI model with visual abilities yet could see more developers, researchers, and startups develop AI agents that can carry out useful chores on your computers for you. Released today by the Allen Institute for AI (Ai2), the Multimodal Open Language Model, or Molmo, can interpret images as well as converse through a chat interface. This means it can make sense of a computer screen, potentially helping an AI agent perform tasks such as browsing the web, navigating through file directories, and drafting documents. "With this release, many more people can deploy a multimodal model," says Ali Farhadi, CEO of Ai2, a research organization based in Seattle, Washington, and a computer scientist at the University of Washington. "It should be an enabler for next-generation apps." So-called AI agents are being widely touted as the next big thing in AI, with OpenAI, Google, and others racing to develop them. Agents have become a buzzword of late, but the grand vision is for AI to go well beyond chatting to reliably take complex and sophisticated actions on computers when given a command. This capability has yet to materialize at any kind of scale. Some powerful AI models already have visual abilities, including GPT-4 from OpenAI, Claude from Anthropic, and Gemini from Google DeepMind. These models can be used to power some experimental AI agents, but they are hidden from view and accessible only via a paid application programming interface, or API. Meta has released a family of AI models called Llama under a license that limits their commercial use, but it has yet to provide developers with a multimodal version. Meta is expected to announce several new products, perhaps including new Llama AI models, at its Connect event today. "Having an open source, multimodal model means that any startup or researcher that has an idea can try to do it," says Ofir Press, a postdoc at Princeton University who works on AI agents. Press says that the fact that Molmo is open source means that developers will be more easily able to fine-tune their agents for specific tasks, such as working with spreadsheets, by providing additional training data. Models like GPT-4 can only be fine-tuned to a limited degree through their APIs, whereas a fully open model can be modified extensively. "When you have an open source model like this then you have many more options," Press says. Ai2 is releasing several sizes of Molmo today, including a 70-billion-parameter model and a 1-billion-parameter one that is small enough to run on a mobile device. A model's parameter count refers to the number of units it contains for storing and manipulating data and roughly corresponds to its capabilities. Ai2 says Molmo is as capable as considerably larger commercial models despite its relatively small size, because it was carefully trained on high-quality data. The new model is also fully open source in that, unlike Meta's Llama, there are no restrictions on its use. Ai2 is also releasing the training data used to create the model, providing researchers with more details of its workings. Releasing powerful models is not without risk. Such models can more easily be adapted for nefarious ends; we may someday, for example, see the emergence of malicious AI agents designed to automate the hacking of computer systems. Farhadi of Ai2 argues that the efficiency and portability of Molmo will allow developers to build more powerful software agents that run natively on smartphones and other portable devices. "The billion parameter model is now performing in the level of or in the league of models that are at least 10 times bigger," he says. Building useful AI agents may depend on more than just more efficient multimodal models, however. A key challenge is making the models work more reliably. This may well require further breakthroughs in AI's reasoning abilities -- something that OpenAI has sought to tackle with its latest model o1, which demonstrates step-by-step reasoning skills. The next step may well be giving multimodal models such reasoning abilities. For now, the release of Molmo means that AI agents are closer than ever -- and could soon be useful even outside of the giants that rule the world of AI.
[2]
AI2's Molmo shows open source can meet, and beat, closed multimodal models | TechCrunch
The common wisdom is that companies like Google, OpenAI, and Anthropic, with bottomless cash reserves and hundreds of top-tier researchers, are the only ones that can make state-of-the-art foundation model. But as one among them famously noted, they "have no moat" -- and AI2 showed that today with the release of Molmo, a multimodal AI model that matches their best while also being small, free, and truly open source. To be clear, Molmo (multimodal open language model) is a visual understanding engine, not a full-service chatbot like ChatGPT. It doesn't have an API, it's not ready for enterprise integration, and it doesn't search the web for you or for its own purposes. You can think of it as the part of those models that sees an image, understands it, and can describe or answer questions about it. Molmo (coming in 72B, 7B, and 1B-parameter variants), like other multimodal models, is capable of identifying and answering questions about almost any everyday situation or object. How do you work this coffee maker? How many dogs in this picture have their tongues out? Which options on this menu are vegan? What are the variables in this diagram? It's the kind of visual understanding task we've seen demonstrated with varying levels of success and latency for years. What's different is not necessarily Molmo's capabilities (which you can see in the demo below, or test here), but how it achieves them. Visual understanding is a broad domain, of course, spanning things like counting sheep in a field to guessing a person's emotional state to summarizing a menu. As such it's difficult to describe, let alone test quantitatively, but as AI2 President Ali Farhadi explained at a demo event at the research organization's HQ in Seattle, you can at least show that two models are similarly in their capabilities. "One thing that we're showing today is that open is equal to closed," he said, "And small is now equal to big." (He clarified that he meant ==, meaning equivalency, not identity; a fine distinction some will appreciate.) One near constant in AI development has been "bigger is better." More training data, more parameters in the resulting model, and more computing power to create and operate them. But at some point you quite literally can't make them any bigger: there isn't enough data to do so, or the compute costs and times get so high it becomes self-defeating. You simply have to make do with what you have, or even better, do more with less. Farhadi explained that Molmo, though it performs on par with the likes of GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet, weighs in at (according to best estimates) about a tenth their seize. And it approaches their level of capability with a model that's a tenth of that. "There are dozen different benchmarks that people evaluate on. I don't like this game, scientifically... but I had to show people a number," he explained. "Our biggest model is a small model, 72B, it's outperforming GPTs and Claudes and Geminis on those benchmarks. Again, take it with a grain of salt; does this mean that this is really better than them or not? I don't know. But at least to us, it means that this is playing the same game." If want to try to stump it, feel free to check out the public demo, which works on mobile too. (If you don't want to log in, you can refresh or scroll up and "edit" the original prompt to replace the image.) The secret is using less, but better quality data. Instead of training on a library of billions of images that can't possibly all be quality controlled, described, or deduplicated, AI2 curated and annotated a set of just 600,000. Obviously that's still a lot, but compared with six billion it's a drop in the bucket - a fraction of a percent. While this leaves off a bit of long tail stuff, their selection process and interesting annotation method gives them very high quality descriptions. Interesting how? Well, they show people and image and tell them to describe it -- out loud. Turns out people talk about stuff differently from how they write about it, and this produces not just accurate but also conversational and useful results. The resulting image descriptions Molmo produces are rich and practical. That is best demonstrated by its new, and for at least a few days unique ability to "point" at the relevant parts of the images. When asked to count the dogs in a photo (33), it put a dot on each of their faces. When asked to count the tongues, it put a dot on each tongue. This specificity lets it do all kinds of new zero-shot actions. And importantly, it works on web interfaces as well: without looking at the website's code, the model understands how to navigate a page, submit a form, and so on. (Rabbit recently showed off something similar for its r1, for release next week.) So why does all this matter? Models come out practically every day. Google just announced some. OpenAI has a demo day coming up. Perplexity is constantly teasing something or another. Meta is hyping up Llama version whatever. Well, Molmo is completely free and open source, as well as being small enough that it can run locally. No API, no subscription, no water-cooled GPU cluster needed. The intent of creating and releasing the model is to empower developers and creators to make AI-powered apps, services, and experiences without needing to seek permission from (and pay) one of the world's largest tech companies. "We're targeting, researchers, developers, app developers, people who don't know how to deal with these [large] models. A key principle in targeting such a wide range of audience is the key principle that we've been pushing for for a while, which is: make it more accessible," Farhadi said. "We're releasing every single thing that we've done. This includes data, cleaning, annotations, training, code, checkpoints, evaluation. We're releasing everything about it that we have developed." He added that he expects people to start building with this dataset and code immediately -- including deep-pocketed rivals, who hoover up any "publicly available" data, meaning anything not nailed down. ("Whether they mention it or not is a whole different story," he added.) The AI world moves fast, but increasingly the giant players are finding themselves in a race to the bottom, lowering prices to the bare minimum while raising hundreds of millions to cover the cost. If similar capabilities are available from free, open source options, can the value offered by those companies really be so astronomical? At the very least, Molmo shows that, though it's an open question whether the emperor has clothes, he definitely doesn't have a moat.
[3]
A tiny new open-source AI model performs as well as powerful big ones
Meanwhile, Ai2 says a smaller Molmo model, with 7 billion parameters, comes close to OpenAI's state-of-the-art model in performance, an achievement it ascribes to vastly more efficient data collection and training methods. What Molmo shows is that open-source AI development is now on par with closed, proprietary models, says Ali Farhadi, the CEO of Ai2. And open-source models have a significant advantage, as their open nature means other people can build applications on top of them. The Molmo demo is available here, and it will be available for developers to tinker with on the Hugging Face website. (Certain elements of the most powerful Molmo model are still shielded from view.) Other large multimodal language models are trained on vast data sets containing billions of images and text samples that have been hoovered from the internet, and they can include several trillion parameters. This process introduces a lot of noise to the training data and, with it, hallucinations, says Ani Kembhavi, a senior director of research at Ai2. In contrast, Ai2's Molmo models have been trained on a significantly smaller and more curated data set containing only 600,000 images, and they have between 1 billion and 72 billion parameters. This focus on high-quality data, versus indiscriminately scraped data, has led to good performance with far fewer resources, Kembhavi says.
Share
Share
Copy Link
Researchers at the Allen Institute for AI have developed Molmo, an open-source multimodal AI model that rivals proprietary models in performance while being significantly smaller and more efficient.
In a groundbreaking development, researchers at the Allen Institute for AI have unveiled Molmo, an open-source multimodal AI model that is making waves in the artificial intelligence community. This compact yet powerful model is challenging the dominance of larger, proprietary AI systems, demonstrating that bigger isn't always better in the world of machine learning 1.
Molmo's most striking feature is its size – or lack thereof. While industry giants like OpenAI's GPT-4 and Google's PaLM boast hundreds of billions of parameters, Molmo operates with a mere 1.8 billion. Despite its diminutive size, Molmo has shown comparable performance to these larger models in various benchmarks, proving that efficiency and clever design can trump sheer scale 3.
What sets Molmo apart is its multimodal capabilities. Unlike many AI models that specialize in either text or image processing, Molmo can handle both with remarkable proficiency. This versatility allows it to perform tasks such as image captioning, visual question answering, and even generating images from text descriptions 2.
As an open-source project, Molmo represents a significant shift in the AI landscape. Its accessibility allows researchers and developers worldwide to study, modify, and improve upon the model. This collaborative approach not only accelerates innovation but also promotes transparency in AI development – a crucial factor as these technologies become increasingly integrated into our daily lives 1.
Molmo's success challenges the prevailing notion that only large tech companies with vast resources can produce cutting-edge AI models. It demonstrates that smaller, more efficient models can achieve comparable results, potentially democratizing AI development and reducing the environmental impact of training and running these systems 3.
While Molmo's performance is impressive, it's important to note that the AI landscape is rapidly evolving. As open-source models like Molmo continue to improve, they may face challenges from proprietary models that are constantly being updated and refined. However, the collaborative nature of open-source development could lead to rapid advancements and innovations that keep pace with or even surpass closed-source alternatives 2.
Reference
[2]
[3]
AI2 introduces Molmo, a free and open-source AI model that outperforms GPT-4 and Claude on certain benchmarks. This development could potentially reshape the AI landscape and democratize access to advanced language models.
3 Sources
3 Sources
The Allen Institute for AI (Ai2) has unveiled OLMo 2, a family of open-source language models that compete with leading AI models while adhering to open-source principles, potentially reshaping the landscape of accessible AI technology.
3 Sources
3 Sources
Recent developments suggest open-source AI models are rapidly catching up to closed models, while traditional scaling approaches for large language models may be reaching their limits. This shift is prompting AI companies to explore new strategies for advancing artificial intelligence.
5 Sources
5 Sources
Meta has released Llama 3.1, its largest and most advanced open-source AI model to date. This 405 billion parameter model is being hailed as a significant advancement in generative AI, potentially rivaling closed-source models like GPT-4.
5 Sources
5 Sources
OpenAI, the company behind ChatGPT, plans to release its first open-weight language model since GPT-2 in 2019. This strategic shift comes as the AI industry increasingly embraces open-source and open-weight models.
17 Sources
17 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved