Microsoft AI Models: 3 Multimodal Tools Challenge Rivals

Microsoft Expands Multimodal AI Capabilities With Three New Models

Microsoft AI announced the release of three foundational AI models on Thursday, marking a strategic push to build out its own stack of multimodal AI capabilities and reduce dependence on its longtime partner OpenAI 1

Source: SiliconANGLE

The trio of AI models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—can generate text from speech, create synthetic audio, and produce images, respectively. All three are now available on Microsoft Foundry and MAI Playground, with plans to integrate them into Microsoft products including Copilot, Bing, and PowerPoint .

The models were developed by Microsoft's MAI Superintelligence team, led by Mustafa Suleyman, CEO of Microsoft AI. Suleyman emphasized the company's distinct approach in a blog post, stating that Microsoft is "building Humanist AI" by "putting humans at the center, optimizing for how people actually communicate, training for practical use" 1

. This release follows a March reorganization where Suleyman shifted away from day-to-day Copilot oversight to focus on frontier model development and superintelligence 3

Source: GeekWire

High-Speed Performance Sets MAI-Transcribe-1 Apart

The speech-to-text model MAI-Transcribe-1 represents Microsoft's first foray into transcription technology. It supports 25 widely used languages, including Hindi, and delivers batch transcription speeds up to 2.5 times faster than Microsoft's existing Azure Fast offering 5

. Suleyman told The Verge that the model runs at "half the GPU cost of the other state-of-the-art models" and was built by a team of just 10 people 3

Microsoft tested the model's mean word error rate across 25 languages, achieving a 3.9% error rate that outperformed both Google's Gemini 3.1 Flash and OpenAI's GPT-Transcribe 4

. The model is designed to handle noisy real-world conditions such as call centers and conference rooms, with features for filtering environmental noise [4](https://siliconangle.com/2026/04/02/microsoft-l aunches-new-high-speed-voice-image-models/). At launch, MAI-Transcribe-1 supports batch transcription of pre-prepared files, with future updates planned to add real-time audio stream transcription and diarization features that can split transcripts into speaker-specific segments 4

Voice Generation and Image Creation Round Out the Lineup

MAI-Voice-1 enables developers to generate natural-sounding synthetic speech generation and create custom voices from just a few seconds of sample audio 3

. The voice generation model can produce up to 60 seconds of audio in one second, making it suitable for applications ranging from accessibility tools to content creation 1

. The model is accessible through Copilot Audio Expressions, an audio creation tool 4

MAI-Image-2, the second generation of Microsoft's in-house AI model for image generation, was originally released on MAI Playground on March 19 before becoming broadly available through Microsoft Foundry 1

. The model can generate images with resolutions up to 1024 by 1024 pixels and processes prompts containing up to 32,000 tokens of text 4

. Using 10 billion to 50 billion non-embedding parameters, MAI-Image-2 delivers at least twice the generation speed compared to earlier versions while maintaining output quality 4

. The model ranks in the top three on the Arena.ai image generation leaderboard 3

Source: CNET

Competitive Pricing Strategy Targets the LLM Market

In an increasingly crowded LLM market, Microsoft positions these high-speed voice and image models as more cost-effective alternatives to offerings from Google and OpenAI 1

. MAI-Transcribe-1 starts at $0.36 per hour, MAI-Voice-1 at $22 per 1 million characters, and MAI-Image-2 at $5 for 1 million tokens for text input and $33 for 1 million tokens for image output 1

. Microsoft claims this represents the best price-performance of any large cloud provider, competing directly with OpenAI's Whisper and Google's Gemini 3

Strategic Independence While Maintaining OpenAI Partnership

Despite releasing its own multimodal AI capabilities, Suleyman reaffirmed Microsoft's commitment to its partnership with OpenAI in an interview with VentureBeat. However, a recent renegotiation of that partnership allowed Microsoft to truly pursue superintelligence research 1

. Microsoft has invested more than $13 billion into the AI research lab and hosts OpenAI models in its various products through a multi-year partnership 1

Suleyman told VentureBeat that Microsoft plans to eventually build a frontier large language model to be "completely independent" if needed 3

. To bolster these efforts, Microsoft recently hired former Allen Institute for AI CEO Ali Farhadi and other top AI researchers from the Seattle-based institute 3

Legacy Tech Advantage in Generative Media Models

As a legacy tech company, Microsoft has the cash and compute resources to invest in generative media models that even billion-dollar startups like OpenAI struggle to sustain 2

. Last week, OpenAI confirmed it will discontinue its Sora AI video app, citing a need to refocus on core activities 2

. Google, another legacy tech company with significant AI research budgets, indicated this week it will continue investing in generative media while focusing on cost and energy efficiency, as demonstrated by its new Veo 3.1 Lite video model 2

Consistent with Microsoft's commitment to safe and responsible AI, these models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers receive built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale 5

. Suleyman wrote that users will "see more models from us soon in Foundry and directly in Microsoft products and experiences" 1

, signaling Microsoft's ongoing commitment to expanding its proprietary AI capabilities in the competitive landscape against Google, Amazon, and others.

Microsoft unveils three foundational AI models to compete with OpenAI and Google

Microsoft Expands Multimodal AI Capabilities With Three New Models

High-Speed Performance Sets MAI-Transcribe-1 Apart

Voice Generation and Image Creation Round Out the Lineup

Competitive Pricing Strategy Targets the LLM Market

Strategic Independence While Maintaining OpenAI Partnership

Legacy Tech Advantage in Generative Media Models

References

Microsoft takes on AI rivals with three new foundational models | TechCrunch

Microsoft's New AI Models Go Beyond Just Text

Microsoft releases new AI models to expand further beyond OpenAI

Microsoft launches new high-speed voice and image models - SiliconANGLE

Microsoft launches 3 AI models for transcription, image, and speech generation - The Economic Times

Related Stories

Microsoft Unveils In-House AI Models, Signaling Potential Shift from OpenAI Dependency

Microsoft targets frontier AI models by 2027 to reduce OpenAI dependence

Microsoft's MAI-Image-2 cracks top three on Arena.ai leaderboard behind Google and OpenAI

Recent Highlights

AI chatbots validate you too much, making you less kind to others, Stanford study reveals

Anthropic's Claude Code Source Leak Reveals Hidden AI Agent Plans and Extensive System Access

Judge blocks Pentagon from branding Anthropic a security risk over AI safety guardrails dispute

Recent Highlights

Today's Top Stories

Google releases Gemma 4 open AI models under Apache 2.0 license, unlocking local deployment

Google Vids gets AI upgrade with Veo 3.1 and directable avatars that interact with objects

OpenAI Acquires TBPN Talk Show in First Media Buy to Reshape AI Communication Strategy

OpenAI's Spud model emerges as major step toward AGI with unified next-gen capabilities