Curated by THEOUTPOST
On Fri, 24 Jan, 12:04 AM UTC
5 Sources
[1]
Hugging Face's New SmolVLM 256M Model Can Run on Consumer Laptops
SmolVLM can analyse images and process visual information at high speeds Hugging Face introduced two new variants to its SmolVLM vision language models last week. The new artificial intelligence (AI) models are available in 256 million and 500 million parameter sizes, with the former being claimed as the world's smallest vision model by the company. The new variants focus on retaining the efficiency of the older two-billion parameter model while reducing the size significantly. The company highlighted that the new models can be locally run on constrained devices, consumer laptops, or even potentially browser-based inference. In a blog post, the company announced the SmolVLM-256M and SmolVLM-500M vision language models, in addition to the existing 2 billion parameter model. The release brings two base models and two instruction fine-tuned models in the abovementioned parameter sizes. Hugging Face said that these models can be loaded directly to transformers, Machine Learning Exchange (MLX), and Open Neural Network Exchange (ONNX) platforms and developers can build on top of the base models. Notably, these are open-source models available with an Apache 2.0 licence for both personal and commercial usage. With the new AI models, Hugging Face aims to bring multimodal models focused on computer vision to portable devices. The 256 million parameter model, for instance, can be run on less than one GB of GPU memory and 15GB of RAM to process 16 images per second (with a batch size of 64). Andrés Marafioti, a machine learning research engineer at Hugging Face told VentureBeat, "For a mid-sized company processing 1 million images monthly, this translates to substantial annual savings in compute costs." To reduce the size of the AI models, the researchers switched the vision encoder from the previous SigLIP 400M to a 93M-parameter SigLIP base patch. Additionally, the tokenisation was also optimised. The new vision models encode images at a rate of 4096 pixels per token, compared to 1820 pixels per token in the 2B model. Notably, the smaller models are also marginally behind the 2B model in terms of performance, but the company said this trade-off has been kept at a minimum. As per Hugging Face, the 256M variant can be used for captioning images or short videos, answering questions about documents, and basic visual reasoning tasks. Developers can use transformers and MLX for inference and fine-tuning the AI model as they work with the old SmolVLM code out-of-the-box. These models are also listed on Hugging Face.
[2]
Can 256M parameters outperform 80B? Hugging Face's SmolVLM models say yes
Hugging Face has released two new AI models, SmolVLM-256M and SmolVLM-500M, claiming they are the smallest of their kind capable of analyzing images, videos, and text on devices with limited RAM, such as laptops. A Small Language Model (SLM) is a neural network designed to produce natural language text. The descriptor "small" applies not only to the physical dimensions of the model but also to its parameter count, neural structure, and the data volume used during training. SmolVLM-256M and SmolVLM-500M consist of 256 million parameters and 500 million parameters, respectively. These models can perform various tasks, including describing images and video clips, as well as answering questions about PDFs and their contents, such as scanned text and charts. Sam Altman to brief officials on 'PhD-level' super AI To train these models, Hugging Face utilized The Cauldron, a curated collection of 50 high-quality image and text datasets, alongside Docmatix, a dataset comprising file scans with detailed captions. Both datasets were created by Hugging Face's M4 team, focused on multimodal AI technologies. The team asserts that SmolVLM-256M and SmolVLM-500M outperform a significantly larger model, Idefics 80B, in benchmarks such as AI2D, which assesses models' abilities to analyze grade-school-level science diagrams. The new models are available for web access and download under an Apache 2.0 license, which allows unrestricted use. Despite their versatility and cost-effectiveness, smaller models like SmolVLM-256M and SmolVLM-500M may exhibit limitations not observed in larger models. A study from Google DeepMind, Microsoft Research, and the Mila research institute highlighted that smaller models often perform suboptimally on complex reasoning tasks, potentially due to their tendency to recognize surface-level patterns rather than applying knowledge in novel contexts. Hugging Face's SmolVLM-256M model operates with less than one gigabyte of GPU memory and outperforms the Idefics 80B model, a system 300 times larger, achieving this reduction and enhancement within 17 months. Andrés Marafioti, a machine learning research engineer at Hugging Face, noted that this achievement reflects a significant breakthrough in vision-language models. The introduction of these models is timely for enterprises facing high computing costs associated with AI implementations. The SmolVLM models are capable of processing images and understanding visual content at unprecedented speeds for models of their size. The 256M version can process 16 examples per second while consuming only 15GB of RAM with a batch size of 64, leading to considerable cost savings for businesses handling large volumes of visual data. IBM has formed a partnership with Hugging Face to incorporate the 256M model into its document processing software, Docling. As Marafioti explained, even organizations with substantial computing resources can benefit from using smaller models to efficiently process millions of documents at reduced costs. Hugging Face achieved size reductions while maintaining performance through advancements in both vision processing and language components, including a switch from a 400M parameter vision encoder to a 93M parameter version and the use of aggressive token compression techniques. This efficiency opens new possibilities for startups and smaller enterprises, enabling them to develop sophisticated computer vision products more rapidly and reduce their infrastructure costs. The SmolVLM models enhance capabilities beyond cost savings, facilitating new applications like advanced document search through an algorithm named ColiPali, which creates searchable databases from document archives. According to Marafioti, these models nearly match the performance of models 10 times their size while significantly increasing the speed of database creation and search, making enterprise-wide visual search feasible for various businesses. The SmolVLM models challenge the conventional belief that larger models are necessary for advanced vision-language tasks, with the 500M parameter version achieving 90% of the performance of a 2.2B parameter counterpart on key benchmarks. Marafioti highlighted that this development demonstrates the usefulness of smaller models, suggesting that they can play a crucial role for businesses.
[3]
Hugging Face claims its new AI models are the smallest of their kind | TechCrunch
A team at AI dev platform Hugging Face has released what they're claiming are the smallest AI models that can analyze images, short videos, and text. The models, SmolVLM-256M and SmolVLM-500M, are designed to work well on "constrained devices" like laptops with under around 1GB of RAM. The team says that they're also ideal for developers trying to process large amounts of data very cheaply. SmolVLM-256M and SmolVLM-500M are just 256 million parameters and 500 million parameters in size, respectively. (Parameters roughly correspond to a model's problem-solving abilities, such as its performance on math tests.) Both models can perform tasks like describing images or video clips and answering questions about PDFs and the elements within them, including scanned text and charts. To train SmolVLM-256M and SmolVLM-500M, the Hugging Face team used The Cauldron, a collection of 50 "high-quality" image and text datasets, and Docmatix, a set of file scans paired with detailed captions. Both were created by Hugging Face's M4 team, which develops multimodal AI technologies. The team claims that both SmolVLM-256M and SmolVLM-500M outperform a much larger model, Idefics 80B, on benchmarks including AI2D, which tests the ability of models to analyze grade-school-level science diagrams. SmolVLM-256M and SmolVLM-500M are available on the web as well as for download from Hugging Face under an Apache 2.0 license, meaning they can be used without restrictions. Small models like SmolVLM-256M and SmolVLM-500M may be inexpensive and versatile, but they can also contain flaws that aren't as pronounced in larger models. A recent study from Google DeepMind, Microsoft Research, and the Mila research institute in Quebec found that many small models perform worse than expected on complex reasoning tasks. The researchers speculated that this could be because smaller models recognize surface-level patterns in data, but struggle to apply that knowledge in new contexts.
[4]
Hugging Face shrinks AI vision models to phone-friendly size, slashing computing costs
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Hugging Face has achieved a remarkable breakthrough in artificial intelligence by introducing vision-language models that run on devices as small as smartphones while outperforming their predecessors that required massive data centers. The company's new SmolVLM-256M model, requiring less than one gigabyte of GPU memory, surpasses the performance of their Idefics 80B model from just 17 months ago -- a system 300 times larger. This dramatic reduction in size while improving capability marks a watershed moment for practical AI deployment. "When we released Idefics 80B in August 2023, we were the first company to open-source a ," said Andrés Marafioti, Machine Learning Research Engineer at Hugging Face, in an exclusive interview with VentureBeat. "By achieving a 300x size reduction while improving performance, SmolVLM marks a breakthrough in vision-language models." Smaller AI Models That Run on Everyday Devices The advancement arrives at a crucial moment for enterprises struggling with the astronomical computing costs of implementing AI systems. The new SmolVLM models -- available in 256M and 500M parameter sizes -- process images and understand visual content at speeds previously unattainable at their size class. The smallest version processes 16 examples per second while using only 15GB of RAM with a batch size of 64, making it particularly attractive for businesses looking to process large volumes of visual data. "For a mid-sized company processing 1 million images monthly, this translates to substantial annual savings in compute costs," Marafioti told VentureBeat. "The reduced memory footprint means businesses can deploy on cheaper cloud instances, cutting infrastructure costs." The development has already caught the attention of major technology players. IBM has partnered with Hugging Face to integrate the 256M model into Docling, their document processing software. "While IBM certainly has access to substantial compute resources, using smaller models like these allows them to efficiently process millions of documents at a fraction of the cost," said Marafioti. How Hugging Face reduced model size without compromising power The efficiency gains come from technical innovations in both the vision processing and language components. The team switched from a 400M parameter vision encoder to a 93M parameter version and implemented more aggressive token compression techniques. These changes maintain high performance while dramatically reducing computational requirements. For startups and smaller enterprises, these developments could be transformative. "Startups can now launch sophisticated computer vision products in weeks instead of months, with infrastructure costs that were prohibitive mere months ago," Marafioti said. The impact extends beyond cost savings to enabling entirely new applications. The models are powering advanced document search capabilities through ColiPali, an algorithm that creates searchable databases from document archives. "They obtain very close performances to those of models 10x the size while significantly increasing the speed at which the database is created and searched, making enterprise-wide visual search accessible to businesses of all types for the first time," according to Marafioti. Why smaller AI models are the future of AI development The breakthrough challenges conventional wisdom about the relationship between model size and capability. While many researchers have assumed that larger models were necessary for advanced vision-language tasks, SmolVLM demonstrates that smaller, more efficient architectures can achieve similar results. The 500M parameter version achieves 90% of the performance of its 2.2B parameter sibling on key benchmarks. Rather than suggesting an efficiency plateau, Marafioti sees these results as evidence of untapped potential: "Until today, the standard was to release VLMs starting at 2B parameters; we thought that smaller models were not useful. We are proving that, in fact, models at 1/10 of the size can be extremely useful for businesses." This development arrives amid growing concerns about AI's environmental impact and computing costs. By dramatically reducing the resources required for vision-language AI, Hugging Face's innovation could help address both issues while making advanced AI capabilities accessible to a broader range of organizations. The models are available open-source, continuing Hugging Face's tradition of increasing access to AI technology. This accessibility, combined with the models' efficiency, could accelerate the adoption of vision-language AI across industries from healthcare to retail, where processing costs have previously been prohibitive. In a field where bigger has long meant better, Hugging Face's achievement suggests a new paradigm: the future of AI might not be found in ever-larger models running in distant data centers, but in nimble, efficient systems running right on our devices. As the industry grapples with questions of scale and sustainability, these smaller models might just represent the biggest breakthrough yet.
[5]
Hugging Face open-sources world's smallest vision language model - SiliconANGLE
Hugging Face open-sources world's smallest vision language model Hugging Face Inc. today open-sourced SmolVLM-256M, a new vision language model with the lowest parameter count in its category. The algorithm's small footprint allows it to run on devices such as consumer laptops that have relatively limited processing power. According to Hugging Face, it could potentially run in browsers as well. The latter feature is facilitated by the model's support for WebGPU, a technology that allows AI-powered web applications to use the graphics cards in the user's computer. SmolVLM-256M lends itself to a range of tasks that involve processing visual data. It can answer questions about scanned documents, describe videos and explain charts. Hugging Face has also developed a version of the model that can customize its output based on user prompts. Under the hood, SmolVLM-256M features 256 million parameters. That's a fraction of the hundreds of billions of parameters included in the most advanced foundation models. The lower a model's parameter count, the less hardware it uses, which is the reason SmolVLM-256M can run on devices such as laptops. The algorithm is the latest in a series of open-source vision language models released by Hugging Face. Compared with the company's earlier models, one of the main improvements in SmolVLM-256M is that it uses a new encoder. This is a software module tasked with turning the files an AI processes into encodings, mathematical structures that neural networks can work with more easily. SmolVLM-256M's encoder is based on an open-source AI called SigLIP base patch-16/512. The latter algorithm, in turn, is derived from an image processing model that OpenAI released in 2021. The encoder includes 93 million parameters, less than one fourth the number of parameters in Hugging Face's previous-generation encoder, which helped the company reduce SmolVLM-256M's hardware footprint. "As a bonus, the smaller encoder processes images at a larger resolution, which (per research from Apple and Google) can often yield better visual understanding without ballooning parameter counts," Hugging Face engineers Andres Marafioti, Miquel Farré and Merve Noyan wrote in a blog post. The company trained the AI on an improved version of a dataset it used to develop its previous-generation vision language models. To boost SmolVLM-256M's reasoning skills, Hugging Face expanded the dataset with a collection of handwritten mathematical expressions. The company also made other additions designed to hone model's document understanding and image captioning skills. In an internal evaluation, Hugging Face compared SmolVLM-256M against a multimodal model with 80 billion parameters that it released 18 months ago. The former algorithm achieved higher scores across more than a half dozen benchmarks. In a benchmark called MathVista that includes geometry problems, SmolVLM-256M's score was more than 10% higher. Hugging Face is rolling out the model alongside a second, more capable algorithm called SmolVLM-500M that features 500 million parameters. It trades off some hardware efficiency for higher output quality. According to Hugging Face, SmolVLM-500M is also better at following user instructions. "If you need more performance headroom while still keeping the memory usage low, SmolVLM-500M is our half-billion-parameter compromise," the company's engineers wrote.
Share
Share
Copy Link
Hugging Face introduces SmolVLM-256M and SmolVLM-500M, the world's smallest vision-language AI models capable of running on consumer devices while outperforming larger counterparts, potentially transforming AI accessibility and efficiency.
Hugging Face, a leading AI development platform, has unveiled two new vision-language models that are set to revolutionize the field of artificial intelligence. The SmolVLM-256M and SmolVLM-500M models, with 256 million and 500 million parameters respectively, are being hailed as the world's smallest of their kind capable of analyzing images, videos, and text on devices with limited computational resources 12.
These new models represent a significant breakthrough in AI efficiency. The SmolVLM-256M model can operate with less than one gigabyte of GPU memory and 15GB of RAM, processing 16 images per second with a batch size of 64 13. This level of performance is particularly impressive considering that it outperforms the Idefics 80B model, which is 300 times larger and was released just 17 months prior 4.
Despite their compact size, the SmolVLM models demonstrate remarkable versatility. They can perform various tasks including:
This broad functionality makes them suitable for a wide range of applications across different industries.
The introduction of these models comes at a crucial time for enterprises grappling with the high computing costs associated with AI implementations. Andrés Marafioti, a machine learning research engineer at Hugging Face, highlighted the potential cost savings: "For a mid-sized company processing 1 million images monthly, this translates to substantial annual savings in compute costs" 34.
The efficiency gains in the SmolVLM models stem from several technical advancements:
The potential of these models has already attracted attention from major tech players. IBM has partnered with Hugging Face to integrate the 256M model into Docling, their document processing software 4. This collaboration demonstrates the models' potential to enhance efficiency in large-scale document processing tasks.
The success of the SmolVLM models challenges the prevailing notion that larger models are necessary for advanced vision-language tasks. The 500M parameter version achieves 90% of the performance of its 2.2B parameter counterpart on key benchmarks 4. This development suggests a new paradigm in AI development, focusing on efficiency and accessibility rather than sheer size.
In line with Hugging Face's commitment to open-source AI, both SmolVLM models are available under an Apache 2.0 license. This allows unrestricted use for both personal and commercial purposes, potentially accelerating the adoption of vision-language AI across various industries 15.
The introduction of these compact yet powerful models could have far-reaching implications for the AI industry. By dramatically reducing the resources required for vision-language AI, Hugging Face's innovation addresses concerns about AI's environmental impact and computing costs. It also opens up possibilities for AI applications on edge devices and in resource-constrained environments 45.
As the industry continues to evolve, the SmolVLM models represent a significant step towards more efficient, accessible, and sustainable AI technologies. Their development suggests that the future of AI might lie not in ever-larger models, but in smarter, more compact solutions that can run on everyday devices.
Reference
[1]
[4]
Mistral AI unveils Mistral Small 3, a 24-billion-parameter open-source AI model that rivals larger competitors in performance while offering improved efficiency and accessibility.
4 Sources
4 Sources
Researchers at the Allen Institute for AI have developed Molmo, an open-source multimodal AI model that rivals proprietary models in performance while being significantly smaller and more efficient.
3 Sources
3 Sources
Meta has released compact versions of its Llama 3.2 1B and 3B AI models, optimized for mobile devices with reduced size and memory usage while maintaining performance.
4 Sources
4 Sources
Major tech companies are developing smaller AI models to improve efficiency, reduce costs, and address environmental concerns, while still maintaining the capabilities of larger models for complex tasks.
2 Sources
2 Sources
The AI industry is witnessing a shift in focus from larger language models to smaller, more efficient ones. This trend is driven by the need for cost-effective and practical AI solutions, challenging the notion that bigger models are always better.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved