Curated by THEOUTPOST
On Fri, 25 Oct, 12:03 AM UTC
4 Sources
[1]
Meta releases compact versions of Llama 3.2 AI models
The new quantised models are 56pc smaller and use 41pc less memory when compared to the full-size model released last month. Meta has released compact versions of its lightweight Llama 3.2 1B and 3B models that are small enough to run effectively on mobile devices. The Facebook owner, in an announcement yesterday (24 October), said that these new "quantised" models are 56pc smaller and use 41pc less memory when compared to the original 3.2 models released last month. Meta says you can use the 1B or 3B models for on-device applications such as summarising a discussion from your phone or calling on-device tools such as calendar. The new models "apply the same quality and safety requirements" as the original Llama 3.2 1B and 3B, while processing information two to three times faster, the company claimed. The new versions were quantised using two techniques - one that prioritises accuracy in low-precision environments and one that prioritises portability while aiming to retain quality. "These models offer a reduced memory footprint, faster on-device inference, accuracy and portability - all while maintaining quality and safety for developers to deploy on resource-constrained devices," the announcement read. Users can download and deploy these new model versions onto mobile CPUs that the company has built in "close collaboration" with other industry leaders in this space, it said. These lightweight models are part of the text-only series of 1B and 3B models, which are available in the EU. The multimodal models, 11B and 90B, which can process multiple formats, such as text, images, audio and video, are not available in the EU. In the summer, Meta said that it will not release these models in the EU because of the bloc's "unpredictable" regulatory environment. The month before this announcement, the company rolled back on plans to train its large language models using public content shared by adults on Facebook and Instagram, following intensive discussion with the Irish Data Protection Commission. Privacy advocacy group Noyb expressed serious concerns about the plan, alleging that Meta's intention to use AI training material sourced from public and licenced data that could include personal information would breach the GDPR. Don't miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic's digest of need-to-know sci-tech news.
[2]
Meta Releases Quantized Llama 3.2 with 4x Inference Speed on Android Phones
These models reduce memory usage by an average of 41% and decrease model size by 56% compared to the initial BF16 format. Meta has introduced quantized versions of its Llama 3.2 models, enhancing on-device AI performance with up to four times faster inference speeds, a 56% model size reduction, and a 41% decrease in memory usage. Check out the models on Hugging Face. These models, designed to operate effectively on mobile devices, can now be accessed through Meta and Hugging Face, expanding deployment possibilities across mobile CPUs in collaboration with Arm, MediaTek, and Qualcomm. These quantized Llama models in the 1B and 3B categories are designed to match the quality and safety standards of their original versions while offering significant improvements in performance, achieving speeds 2-4 times faster. Additionally, these models reduce memory usage by an average of 41% and decrease model size by 56% compared to the initial BF16 format. The Llama Stack reference implementation, through PyTorch's ExecuTorch framework, supports inferences for both quantization techniques. Developed in partnership with industry leaders, these optimised models are now available for Qualcomm and MediaTek SoCs with Arm CPUs. Meta is also exploring additional performance gains through NPU support, collaborating with partners to integrate NPU functionalities within the ExecuTorch open-source ecosystem. These efforts aim to optimise Llama 1B and 3B quantized models specifically for NPU utilisation, enhancing the models' efficiency across a broader range of devices. Previously, quantized models often sacrificed accuracy and performance, but Meta's use of Quantization-Aware Training (QAT) and LoRA adaptors ensures that the new models maintain quality and safety standards. Meta utilised SpinQuant, a state-of-the-art quantization method, to prioritise model portability, reducing size while preserving functionality. This approach allows for substantial compression without compromising on inference quality, and testing on devices like the Android OnePlus 12 confirms the models' efficiency. Performance testing has confirmed similar efficiencies on Samsung devices (S24+ for 1B and 3B, S22 for 1B) and shows comparable accuracy on iOS devices, though further performance evaluations are ongoing. Quantization-Aware Training was applied to Llama 3.2 models by simulating quantization during the training process to optimise performance in low-precision environments. This process involved refining BF16 Llama 3.2 model checkpoints using QAT and an additional supervised fine-tuning (SFT) round with LoRA adaptors. These adaptors maintain weights and activations in BF16, resembling Meta's QLoRA technique, which combines quantization and LoRA for greater model efficiency. Looking forward, Meta's approach of making these quantized models will help in several on-device inference of AI models, including its products like Meta RayBan smart glasses. In September, Meta had launched Llama 3.2, which beat all closed source models including GPT-4o on several benchmarks. Meta recently also unveiled Meta Spirit LM, an open-source multimodal language model focused on the seamless integration of speech and text.
[3]
Meta debuts slimmed-down Llama models for low-powered devices - SiliconANGLE
Meta debuts slimmed-down Llama models for low-powered devices Meta Platforms Inc. is striving to make its popular open-source large language models more accessible with the release of "quantized" versions of the Llama 3.2 1B and Llama 3B models, designed to run on low-powered devices. The Llama 3.2 1B and 3B models were announced at Meta's Connect 2024 event last month. They're the company's smallest LLMs so far, designed to address the demand to run generative artificial intelligence on-device and in edge deployments. Now it's releasing quantized, or lightweight, versions of those models, which come with a reduced memory footprint and support faster on-device inference, with greater accuracy, the company said. It's all in the pursuit of portability, Meta said, enabling the Llama 3.2 1B and 3B models to be deployed on resource-constrained devices while maintaining their strong performance. In a blog post today, Meta's AI research team explained that, thanks to the limited runtime memory available on mobile devices, it opted to prioritize "short-context applications up to 8K" for the quantized models. Quantization is a technique that can be applied to reduce the size of large language models by modifying the precision of their model weights. Meta's researchers said they used two different methods to quantize the Llama 3.2 1B and 3B models, including a technique known as "Quantization-Aware Training with LoRA adaptors," or QLoRA, which helps to optimize their performance in low-precision environments. The QLoRA method helps to prioritize accuracy when quantizing LLMs, but in cases where developers would rather put more emphasis on portability at the expense of performance, a second technique, known as SpinQuant, can be used. Using SpinQuant, Meta said it can determine the best possible combination for compression, so as to ensure the model can be ported to the target device while retaining the best possible performance. Meta noted that inference using both of the quantization techniques is supported in the Llama Stack reference implementation via PyTorch's ExecuTorch framework. In its tests, Meta demonstrated that the quantized Llama 3.2 1B and Llama 3B models enable an average reduction in model size of 56% compared to the original formats, resulting in a two- to four-times speedup in terms of inference processing. The company said tests with Android OnePlus 12 smartphones showed that the models reduced memory resource usage by an average of 41%, while almost matching the performance of the full-sized versions. Meta developed the quantized Llama 3.2 1B and Llama 3B models in collaboration with Qualcomm Inc and MediaTek Inc. to ensure that they're optimized to run on those companies' Arm-based system-on-chip hardware. It added that it used Kleidi AI kernels to optimize the models for mobile central processing units. By enabling the Llama models to run on mobile CPUs, developers will be able to create more unique AI experiences with greater privacy, with all interactions taking place on the device. The quantized Llama 3.2 1B and Llama 3B models can be downloaded from Llama.com and Hugging Face starting today. Meta's AI research efforts have been in overdrive this month. The quantized Llama models are the company's fourth major announcement in just the last three weeks. At the start of the month, the company unveiled a family of Meta Movie Gen models that can be used to create and edit video footage with text-based prompts. A few days later, it announced a host of new generative AI advertising features for marketers, and late last week it debuted an entirely new model called Spirit LM, for creating expressive AI-generated voices that reflect happiness, sadness, anger, surprise and other emotions.
[4]
Introducing quantized Llama models with increased speed and a reduced memory footprint
Similar improvements were observed for the 3B model. See results section for more details. At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B -- our smallest models yet -- to address the demand for on-device and edge deployments. Since their release, we've seen not just how the community has adopted our lightweight models, but also how grassroots developers are quantizing them to save capacity and memory footprint, often at a tradeoff to performance and accuracy. As we've shared before, we want to make it easier for more developers to build with Llama, without needing significant compute resources and expertise. Today, we're sharing quantized versions of Llama 3.2 1B and 3B models. These models offer a reduced memory footprint, faster on-device inference, accuracy, and portability -- all while maintaining quality and safety for developers to deploy on resource-constrained devices. Given the limited runtime memory available on mobile devices, we prioritized short-context applications up to 8K for these new quantized models. Our results show we can achieve superior accuracy by training with quantization as opposed to post-processing. The models we are sharing today have 2-4x speedup and an average reduction of 56% in model size compared to the original format, based on testing with Android OnePlus 12 models. We also reduce memory usage by an average of 41%. Starting today, the community can deploy our quantized models onto more mobile CPUs, giving them the opportunity to build unique experiences that are fast and provide more privacy since interactions stay entirely on device. We developed these state-of-the-art models using Quantization-Aware Training with LoRA adaptors (QLoRA) to optimize performance in low-precision environments. We also used SpinQuant, a technique that enables us to determine the best possible combination for compression while retaining the most possible quality. As a result of the close collaborative work with our industry-leading partners, QLoRA and SpinQuant Llama models are available on Qualcomm and MediaTek SoCs with Arm CPUs. The performance of the quantized models has been optimized for mobile CPUs using Kleidi AI kernels, and we're currently collaborating with our partners to utilize NPUs for even greater performance for Llama 1B/3B.
Share
Share
Copy Link
Meta has released compact versions of its Llama 3.2 1B and 3B AI models, optimized for mobile devices with reduced size and memory usage while maintaining performance.
Meta has unveiled quantized versions of its Llama 3.2 1B and 3B AI models, marking a significant advancement in on-device artificial intelligence capabilities. These compact models, designed to run efficiently on mobile devices, offer improved performance while maintaining the quality and safety standards of their original counterparts 12.
The new quantized models boast impressive enhancements:
These improvements enable the models to operate effectively on resource-constrained devices, such as smartphones 34.
Meta employed two primary quantization techniques to achieve these results:
Quantization-Aware Training (QAT) with LoRA adaptors: This method optimizes performance in low-precision environments while prioritizing accuracy 24.
SpinQuant: A technique that focuses on model portability, allowing for substantial compression without compromising inference quality 24.
The development of these quantized models involved close collaboration with industry leaders:
This collaborative effort ensures that the models are well-suited for a wide range of mobile devices and can leverage specific hardware capabilities for optimal performance 34.
The quantized Llama 3.2 models open up new possibilities for on-device AI applications, including:
Meta is exploring additional performance gains through Neural Processing Unit (NPU) support, working with partners to integrate NPU functionalities within the ExecuTorch open-source ecosystem. This effort aims to further optimize the quantized models for a broader range of devices 24.
The quantized Llama 3.2 1B and 3B models are now available for download from Llama.com and Hugging Face. This release allows developers to create unique AI experiences with enhanced privacy, as all interactions can take place directly on the user's device 34.
The release of these optimized models represents a significant step towards making advanced AI capabilities more accessible on everyday devices. By reducing the computational and memory requirements, Meta is enabling a wider range of applications and use cases for on-device AI, potentially accelerating innovation in mobile AI technologies 1234.
Reference
[1]
[2]
Analytics India Magazine
|Meta Releases Quantized Llama 3.2 with 4x Inference Speed on Android PhonesMeta has released Llama 3.3, a 70 billion parameter AI model that offers performance comparable to larger models at a fraction of the cost, marking a significant advancement in open-source AI technology.
11 Sources
11 Sources
Meta has released Llama 3, an open-source AI model that can run on smartphones. This new version includes vision capabilities and is freely accessible, marking a significant step in AI democratization.
3 Sources
3 Sources
Meta has released Llama 3, its latest and most advanced AI language model, boasting significant improvements in language processing and mathematical capabilities. This update positions Meta as a strong contender in the AI race, with potential impacts on various industries and startups.
22 Sources
22 Sources
Meta Platforms Inc. has released its latest and most powerful AI model, Llama 3, boasting significant improvements in language understanding and mathematical problem-solving. This open-source model aims to compete with OpenAI's GPT-4 and Google's Gemini.
4 Sources
4 Sources
Meta has introduced Llama 3.2, an advanced open-source multimodal AI model. This new release brings significant improvements in vision capabilities, text understanding, and multilingual support, positioning it as a strong competitor to proprietary models from OpenAI and Anthropic.
16 Sources
16 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved