Curated by THEOUTPOST
On Thu, 27 Feb, 8:05 AM UTC
5 Sources
[1]
Microsoft's Phi-4-multimodal AI model handles speech, text, and video
The new small language model can help developers build multimodal AI applications for lightweight computing devices, Microsoft says. Microsoft has introduced a new AI model that, it says, can process speech, vision, and text locally on-device using less compute capacity than previous models. Innovation in generative artificial intelligence isn't all about large language models (LLMs) running in big data centers: There's also a lot of work going on around small language models (SLMs) that can run on more resource-constrained devices such as mobile phones, laptops, and other edge computing devices. Microsoft's contribution is a suite of small models called Phi, of which it introduced the fourth generation in December.
[2]
Microsoft expands Phi line with new multimodal models
Microsoft Corp. has expanded its Phi line of open-source language models with the introduction of two new algorithms designed for multimodal processing and hardware efficiency: Phi-4-mini and Phi-4-multimodal. Phi-4-mini is a text-only model that incorporates 3.8 billion parameters, enabling it to run efficiently on mobile devices. It is based on a decoder-only transformer architecture, which analyzes only the text preceding a word to determine its meaning, thus enhancing processing speed and reducing hardware requirements. Furthermore, Phi-4-mini utilizes a performance optimization technique known as grouped query attention (GQA) to decrease hardware usage associated with its attention mechanism. Microsoft Phi-4 AI tackles complex math with 14B parameters This model is capable of generating text, translating documents, and executing actions within external applications. Microsoft claims Phi-4-mini excels in tasks requiring complex reasoning, such as mathematical computations and coding challenges, achieving significantly improved accuracy in internal benchmark tests compared to other similarly sized language models. The second model, Phi-4-multimodal, is an enhanced version of Phi-4-mini, boasting 5.6 billion parameters. It is capable of processing text, images, audio, and video inputs. This model was trained using a new technique called Mixture of LoRAs, which optimizes the model's capabilities for multimodal processing without extensive modifications to its existing weights. Microsoft conducted benchmark tests on Phi-4-multimodal, where it earned an average score of 72 in visual data processing, just shy of OpenAI's GPT-4, which scored 73. Google's Gemini Flash 2.0 led with a score of 74.3. In combined visual and audio tasks, Phi-4-multimodal outperformed Gemini-2.0 Flash "by a large margin" and surpassed InternOmni, which is specialized for multimodal processing. Both Phi-4-multimodal and Phi-4-mini are licensed under the MIT license and will be made available through Hugging Face, allowing for commercial use. Developers can access these models through Azure AI Foundry and NVIDIA API Catalog to explore their potential further. Phi-4-multimodal is particularly designed to facilitate natural and context-aware interactions by integrating multiple input types into a single processing model. It includes enhancements such as a larger vocabulary, multilingual capabilities, and improved computational efficiency for on-device execution. Phi-4-mini delivers impressive performance in text-based tasks, including reasoning and function-calling capabilities, enabling it to interact with structured programming interfaces effectively. The platform supports sequences up to 128,000 tokens. Furthermore, both models have undergone extensive security and safety testing, led by Microsoft's internal Azure AI Red Team (AIRT), which assessed the models using comprehensive evaluation methodologies that address current trends in cybersecurity, fairness, and user safety. Customization and ease of deployment are additional advantages of these models, as their smaller sizes allow them to be fine-tuned for specific tasks with relatively low computational demands. Examples of tasks suitable for fine-tuning include speech translation and medical question answering. For further details on the models and their applications, developers are encouraged to refer to the Phi Cookbook available on GitHub.
[3]
Microsoft releases new Phi models optimized for multimodal processing, efficiency - SiliconANGLE
Microsoft releases new Phi models optimized for multimodal processing, efficiency Microsoft Corp. today expanded its Phi line of open-source language models with two new algorithms optimized for multimodal processing and hardware-efficiency, respectively. The first addition is the text-only Phi-4-mini. The second new model, Phi-4-multimodal, is an upgraded version of Phi-4-mini that can also process visual and audio input. Microsoft says that both models significantly outperform comparably-sized alternatives at certain tasks. Phi-4-mini, the text-only model, features 3.8 billion parameters. That makes it compact enough to run on mobile devices. It's based on the ubiquitous transformer neural network architecture that underpins most LLMs. A standard transformer model analyzes the text before and after a word to understand its meaning. According to Microsoft, Phi-4-mini is based on a version of the architecture called a decoder-only transformer that takes a different approach. Such models only analyze the text that precedes a word when trying to determine its meaning, which lowers hardware usage and speeds up processing speed. Phi-4-mini also uses a second performance optimization technique called grouped query attention, or GQA. It reduces the hardware usage of the algorithm's attention mechanism. A language model's attention mechanism helps it determine which data points are most relevant to a given processing task. Phi-4-mini can generate text, translate existing documents and take actions in external applications. According to Microsoft, it's particularly adept at math and coding tasks that require "complex reasoning." In a series of internal benchmark tests, the company determined that Phi-4-mini can complete such tasks with "significantly" better accuracy than several similarly-sized language models. The second new model that Microsoft released today, Phi-4-multimodal, is an upgraded version of Phi-4-mini with 5.6 billion parameters. It can process not only text but also images, audio and video. Microsoft trained the model using a new technique it dubs Mixture of LoRAs. Adapting an AI to a new task usually requires changing its weights, the configuration settings that determine how it crunches data. This process can be costly and time-consuming. As a result, researchers often use a different approach known as LORA. Instead of modifying existing weights, LoRA teaches a model to perform an unfamiliar task by adding a small number of new weights optimized for that task. Microsoft's Mixture of LoRA method applies the same concept to multimodal processing. To create Phi-4-multimodal, the company extended Phi-4-mini with weights optimized to process audio and visual data. According to Microsoft, the technique mitigates some of the trade-offs associated with other approaches to building multimodal models. The company tested Phi-4-multimodal's capabilities using more than a half dozen visual data processing benchmarks. The model achieved an average score of 72, trailing OpenAI's GPT-4 by less than one point. Google LLC's Gemini Flash 2.0, a cutting-edge large language model that debuted in December, scored 74.3. Phi-4-multimodal achieved even better performance in a set of benchmark tests that involved both visual and audio input. According to Microsoft, the model outperformed Gemini-2.0 Flash "by a large margin." Phi-4-multimodal also bested InternOmni, an open-source LLM that is built specifically to process multimodal data and has a higher parameter count.
[4]
Microsoft Launches Phi-4 multimodal and Phi-4-mini, Matches OpenAI's GPT-4o
The Phi-4 multimodal model supports applications including document analysis and speech recognition. Microsoft has launched Phi-4-multimodal and Phi-4-mini, the latest additions to its Phi family of small language models (SLMs). These models are now available on Azure AI Foundry, Hugging Face, and the NVIDIA API Catalog. Phi-4-multimodal is a 5.6 billion-parameter model that integrates speech, vision, and text processing. "By leveraging advanced cross-modal learning techniques, this model enables more natural and context-aware interactions, allowing devices to understand and reason across multiple input modalities simultaneously," said Weizhu Chen, vice president of generative AI at Microsoft. Last year, Microsoft launched phi-4, with 14 billion parameters. The model excels at complex reasoning capabilities. The Phi-4 multimodal model supports applications including document analysis and speech recognition. On multimodal audio and visual benchmarks, it surpasses Google Gemini 2 Flash and Gemini 1.5 Pro. Microsoft claims that it is comparable to OpenAI's GPT-4o. The company said it has demonstrated strong performance in speech-related tasks, surpassing models such as WhisperV3 and SeamlessM4T-v2-Large in automatic speech recognition and speech translation. It also ranks first on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%. The model shows competitive results in document and chart understanding, Optical Character Recognition (OCR), and visual science reasoning. On the other hand, Phi-4-mini is a 3.8 billion-parameter text-based model for reasoning, coding, and long-context tasks. It supports sequences of up to 128,000 tokens and offers efficient processing with reduced computational requirements. It supports function calling, allowing integration with external tools and APIs. Both of the models are suitable for deployment in constrained computing environments. They can be optimised using ONNX Runtime for cross-platform availability and lower latency. Microsoft is incorporating these models into its ecosystem, including Windows applications and Copilot+ PCs. "Copilot+ PCs will build upon Phi-4-multimodal's capabilities, delivering the power of Microsoft's advanced SLMs without the energy drain," said Vivek Pradeep, vice president and distinguished engineer of Windows Applied Sciences. Developers can access Phi-4-multimodal and Phi-4-mini on multiple platforms and explore their applications in various industries, including finance, healthcare, and automotive technology.
[5]
Microsoft's new Phi-4 AI models pack big performance in small packages
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Microsoft has introduced a new class of highly efficient AI models that process text, images, and speech simultaneously while requiring significantly less computing power than existing systems. The new Phi-4 models, released today, represent a breakthrough in the development of small language models (SLMs) that deliver capabilities previously reserved for much larger AI systems. Phi-4-Multimodal, a model with just 5.6 billion parameters, and Phi-4-Mini, with 3.8 billion parameters, outperform similarly sized competitors and even match or exceed the performance of models twice their size on certain tasks, according to Microsoft's technical report. "These models are designed to empower developers with advanced AI capabilities," said Weizhu Chen, Vice President, Generative AI at Microsoft. "Phi-4-multimodal, with its ability to process speech, vision, and text simultaneously, opens new possibilities for creating innovative and context-aware applications." The technical achievement comes at a time when enterprises are increasingly seeking AI models that can run on standard hardware or at the "edge" -- directly on devices rather than in cloud data centers -- to reduce costs and latency while maintaining data privacy. How Microsoft Built a Small AI Model That Does It All What sets Phi-4-Multimodal apart is its novel "mixture of LoRAs" technique, enabling it to handle text, images, and speech inputs within a single model. "By leveraging the Mixture of LoRAs, Phi-4-Multimodal extends multimodal capabilities while minimizing interference between modalities," the research paper states. "This approach enables seamless integration and ensures consistent performance across tasks involving text, images, and speech/audio." The innovation allows the model to maintain its strong language capabilities while adding vision and speech recognition without the performance degradation that often occurs when models are adapted for multiple input types. The model has claimed the top position on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%, outperforming specialized speech recognition systems like WhisperV3. It also demonstrates competitive performance on vision tasks like mathematical and scientific reasoning with images. Compact AI, massive impact: Phi-4-mini sets new performance standards Despite its compact size, Phi-4-Mini demonstrates exceptional capabilities in text-based tasks. Microsoft reports the model "outperforms similar size models and is on-par with models twice larger" across various language understanding benchmarks. Particularly notable is the model's performance on math and coding tasks. According to the research paper, "Phi-4-Mini consists of 32 Transformer layers with hidden state size of 3,072" and incorporates group query attention to optimize memory usage for long-context generation. On the GSM-8K math benchmark, Phi-4-Mini achieved an 88.6% score, outperforming most 8-billion parameter models, while on the MATH benchmark it reached 64%, substantially higher than similar-sized competitors. "For the Math benchmark, the model outperforms similar sized models with large margins, sometimes more than 20 points. It even outperforms two times larger models' scores," the technical report notes. Transformative deployments: Phi-4's real-world efficiency in action Capacity, an AI Answer Engine that helps organizations unify diverse datasets, has already leveraged the Phi family to enhance their platform's efficiency and accuracy. Steve Frederickson, Head of Product at Capacity, said in a statement, "From our initial experiments, what truly impressed us about the Phi was its remarkable accuracy and the ease of deployment, even before customization. Since then, we've been able to enhance both accuracy and reliability, all while maintaining the cost-effectiveness and scalability we valued from the start." Capacity reported a 4.2x cost savings compared to competing workflows while achieving the same or better qualitative results for preprocessing tasks. AI without limits: Microsoft's Phi-4 models bring advanced intelligence anywhere For years, AI development has been driven by a singular philosophy: bigger is better. More parameters, larger models, greater computational demands. But Microsoft's Phi-4 models challenge that assumption, proving that power isn't just about scale -- it's about efficiency. Phi-4-Multimodal and Phi-4-Mini are designed not for the data centers of tech giants, but for the real world -- where computing power is limited, privacy concerns are paramount, and AI needs to work seamlessly without a constant connection to the cloud. These models are small, but they carry weight. Phi-4-Multimodal integrates speech, vision, and text processing into a single system without sacrificing accuracy, while Phi-4-Mini delivers math, coding, and reasoning performance on par with models twice its size. This isn't just about making AI more efficient; it's about making it more accessible. Microsoft has positioned Phi-4 for widespread adoption, making it available through Azure AI Foundry, Hugging Face, and the Nvidia API Catalog. The goal is clear: AI that isn't locked behind expensive hardware or massive infrastructure, but one that can operate on standard devices, at the edge of networks, and in industries where compute power is scarce. Masaya Nishimaki, a director at the Japanese AI firm Headwaters Co., Ltd., sees the impact firsthand. "Edge AI demonstrates outstanding performance even in environments with unstable network connections or where confidentiality is paramount," he said in a statement. That means AI that can function in factories, hospitals, autonomous vehicles -- places where real-time intelligence is required, but where traditional cloud-based models fall short. At its core, Phi-4 represents a shift in thinking. AI isn't just a tool for those with the biggest servers and the deepest pockets. It's a capability that, if designed well, can work anywhere, for anyone. The most revolutionary thing about Phi-4 isn't what it can do -- it's where it can do it.
Share
Share
Copy Link
Microsoft introduces Phi-4-multimodal and Phi-4-mini, new small language models capable of processing text, speech, and visual data with impressive efficiency and performance.
Microsoft has expanded its Phi line of open-source language models with the introduction of two new algorithms: Phi-4-multimodal and Phi-4-mini. These small language models (SLMs) are designed to process multiple types of data efficiently, challenging the notion that bigger models are always better 123.
Phi-4-mini is a text-only model with 3.8 billion parameters, optimized for mobile devices and edge computing 2. Key features include:
In benchmark tests, Phi-4-mini outperformed similarly-sized models and matched the capabilities of some models twice its size 5.
Building on Phi-4-mini, the Phi-4-multimodal model boasts 5.6 billion parameters and can process text, images, audio, and video inputs 23. Notable aspects include:
Both models are designed for deployment in constrained computing environments, offering several advantages:
Microsoft is incorporating these models into its ecosystem, including Windows applications and Copilot+ PCs 4.
The Phi-4 models have already shown promise in real-world applications:
Microsoft's Phi-4 models represent a significant step towards making advanced AI capabilities more accessible and efficient. By delivering high performance in a compact package, these models open up new possibilities for AI integration across various devices and industries, potentially transforming how AI is deployed and utilized in everyday applications 5.
Reference
[2]
[3]
[4]
Analytics India Magazine
|Microsoft Launches Phi-4 multimodal and Phi-4-mini, Matches OpenAI's GPT-4oMicrosoft has released a new series of Phi-3.5 AI models, showcasing impressive performance despite their smaller size. These models are set to compete with offerings from OpenAI and Google, potentially reshaping the AI landscape.
4 Sources
4 Sources
Microsoft unveils Phi-4, a 14-billion-parameter AI model that challenges the "bigger is better" paradigm by outperforming larger models in mathematical reasoning and language processing tasks while using fewer computational resources.
10 Sources
10 Sources
Microsoft has released its Phi-4 small language model as open-source, making it freely available on Hugging Face. Despite its compact size, Phi-4 demonstrates impressive performance in various benchmarks, challenging larger models.
5 Sources
5 Sources
Microsoft has released new Phi-3.5 models, including Vision, Instruct, and Mini-MoE variants. These models demonstrate superior performance compared to offerings from Google, Meta, and OpenAI across various benchmarks.
3 Sources
3 Sources
OpenAI has introduced GPT-4o Mini, a more affordable version of its top AI model. This new offering aims to make advanced AI technology more accessible to developers and businesses while potentially reshaping the competitive landscape in the AI industry.
5 Sources
5 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved