Microsoft unveils three foundational AI models to compete with OpenAI and Google

Reviewed byNidhi Govil

6 Sources

Share

Microsoft released three foundational AI models designed to generate text, voice, and images, signaling its push to build multimodal AI capabilities independent of OpenAI. The models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—promise faster performance and competitive pricing compared to rivals like Google and OpenAI, while being integrated into Microsoft products including Copilot, Bing, and PowerPoint.

Microsoft Expands Multimodal AI Capabilities With Three New Models

Microsoft AI announced the release of three foundational AI models on Thursday, marking a strategic push to build out its own stack of multimodal AI capabilities and reduce dependence on its longtime partner OpenAI

1

.

Source: SiliconANGLE

Source: SiliconANGLE

The trio of AI models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—can generate text from speech, create synthetic audio, and produce images, respectively. All three are now available on Microsoft Foundry and MAI Playground, with plans to integrate them into Microsoft products including Copilot, Bing, and PowerPoint .

The models were developed by Microsoft's MAI Superintelligence team, led by Mustafa Suleyman, CEO of Microsoft AI. Suleyman emphasized the company's distinct approach in a blog post, stating that Microsoft is "building Humanist AI" by "putting humans at the center, optimizing for how people actually communicate, training for practical use"

1

. This release follows a March reorganization where Suleyman shifted away from day-to-day Copilot oversight to focus on frontier model development and superintelligence

3

.

Source: GeekWire

Source: GeekWire

High-Speed Performance Sets MAI-Transcribe-1 Apart

The speech-to-text model MAI-Transcribe-1 represents Microsoft's first foray into transcription technology. It supports 25 widely used languages, including Hindi, and delivers batch transcription speeds up to 2.5 times faster than Microsoft's existing Azure Fast offering

5

. Suleyman told The Verge that the model runs at "half the GPU cost of the other state-of-the-art models" and was built by a team of just 10 people

3

.

Microsoft tested the model's mean word error rate across 25 languages, achieving a 3.9% error rate that outperformed both Google's Gemini 3.1 Flash and OpenAI's GPT-Transcribe

4

. The model is designed to handle noisy real-world conditions such as call centers and conference rooms, with features for filtering environmental noise [4](https://siliconangle.com/2026/04/02/microsoft-l aunches-new-high-speed-voice-image-models/). At launch, MAI-Transcribe-1 supports batch transcription of pre-prepared files, with future updates planned to add real-time audio stream transcription and diarization features that can split transcripts into speaker-specific segments

4

.

Voice Generation and Image Creation Round Out the Lineup

MAI-Voice-1 enables developers to generate natural-sounding synthetic speech generation and create custom voices from just a few seconds of sample audio

3

. The voice generation model can produce up to 60 seconds of audio in one second, making it suitable for applications ranging from accessibility tools to content creation

1

. The model is accessible through Copilot Audio Expressions, an audio creation tool

4

.

MAI-Image-2, the second generation of Microsoft's in-house AI model for image generation, was originally released on MAI Playground on March 19 before becoming broadly available through Microsoft Foundry

1

. The model can generate images with resolutions up to 1024 by 1024 pixels and processes prompts containing up to 32,000 tokens of text

4

. Using 10 billion to 50 billion non-embedding parameters, MAI-Image-2 delivers at least twice the generation speed compared to earlier versions while maintaining output quality

4

. The model ranks in the top three on the Arena.ai image generation leaderboard

3

.

Source: CNET

Source: CNET

Competitive Pricing Strategy Targets the LLM Market

In an increasingly crowded LLM market, Microsoft positions these high-speed voice and image models as more cost-effective alternatives to offerings from Google and OpenAI

1

. MAI-Transcribe-1 starts at $0.36 per hour, MAI-Voice-1 at $22 per 1 million characters, and MAI-Image-2 at $5 for 1 million tokens for text input and $33 for 1 million tokens for image output

1

. Microsoft claims this represents the best price-performance of any large cloud provider, competing directly with OpenAI's Whisper and Google's Gemini

3

.

Strategic Independence While Maintaining OpenAI Partnership

Despite releasing its own multimodal AI capabilities, Suleyman reaffirmed Microsoft's commitment to its partnership with OpenAI in an interview with VentureBeat. However, a recent renegotiation of that partnership allowed Microsoft to truly pursue superintelligence research

1

. Microsoft has invested more than $13 billion into the AI research lab and hosts OpenAI models in its various products through a multi-year partnership

1

.

Suleyman told VentureBeat that Microsoft plans to eventually build a frontier large language model to be "completely independent" if needed

3

. To bolster these efforts, Microsoft recently hired former Allen Institute for AI CEO Ali Farhadi and other top AI researchers from the Seattle-based institute

3

.

Legacy Tech Advantage in Generative Media Models

As a legacy tech company, Microsoft has the cash and compute resources to invest in generative media models that even billion-dollar startups like OpenAI struggle to sustain

2

. Last week, OpenAI confirmed it will discontinue its Sora AI video app, citing a need to refocus on core activities

2

. Google, another legacy tech company with significant AI research budgets, indicated this week it will continue investing in generative media while focusing on cost and energy efficiency, as demonstrated by its new Veo 3.1 Lite video model

2

.

Consistent with Microsoft's commitment to safe and responsible AI, these models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers receive built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale

5

. Suleyman wrote that users will "see more models from us soon in Foundry and directly in Microsoft products and experiences"

1

, signaling Microsoft's ongoing commitment to expanding its proprietary AI capabilities in the competitive landscape against Google, Amazon, and others.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo