6 Sources
6 Sources
[1]
Microsoft takes on AI rivals with three new foundational models | TechCrunch
Microsoft AI, the tech giant's research lab, announced the release of three foundational AI models on Thursday that can generate text, voice, and images. The release signals Microsoft's continued push to build out its own stack of multimodal AI models -- and compete with rival AI labs -- even though it remains tied to OpenAI. MAI-Transcribe-1 transcribes speech across 25 different languages into text and is 2.5 times faster than Microsoft's Azure Fast offering, according to a company press release. MAI-Voice-1 is an audio-generating model. This voice model allows users to generate 60 seconds of audio in one second and allows users to create a custom voice. MAI-Image-2 is a video-generating model. MAI-Image-2 was originally released on MAI Playground, a new large language model testing software on March 19. Now, all three models are being released on Microsoft Foundry and the transcription and voice models are available in MAI Playground as well. The models were developed by Microsoft's MAI Superintelligence team, an AI research team led by Mustafa Suleyman, the CEO of Microsoft AI, that was formed and announced in November 2025. "At Microsoft AI, we're building Humanist AI. We have a distinct view when creating our AI models -- putting humans at the center, optimizing for how people actually communicate, training for practical use," Suleyman wrote in a blog post. "You'll see more models from us soon in Foundry and directly in Microsoft products and experiences." In an increasingly crowded LLM market, MAI hopes a selling point for these models is that they are cheaper than those from Google and OpenAI, the company wrote in the blog post. MAI-Transcribe-1 starts at $0.36 per hour. MAI-Voice-1 starts at $22 per 1 million characters, and MAI-Image-2 starts at $5 for 1 million tokens for text input and $33 for 1 million tokens for image output. Despite releasing its own models, Suleyman reaffirmed Microsoft's commitment to its partnership with OpenAI in an interview with VentureBeat -- although a recent renegotiation of that partnership allowed Microsoft to truly pursue this superintelligence research, Suleyman told The Verge. Microsoft has invested more than $13 billion into the AI research lab and hosts its models in its various products through a multi-year partnership. Microsoft takes the same stance with chips; it both produces its own and buys from outside players as well.
[2]
Microsoft's New AI Models Go Beyond Just Text
Microsoft is doubling down on AI models that aren't large language models. The company announced on Thursday that it is releasing three new models: brand new models for voice and text transcription and the second generation of its in-house image model. The voice and text transcription models are the first of their kind from Microsoft. The transcription model can translate recordings into text in 25 different languages. It's built for video captioning, meeting transcription and voice agents. The voice model can create audio recordings up to 60 seconds long. The company says its second-generation image model has a faster generation speed and more lifelike depictions, improving on its previous model. They're available now in Microsoft's Foundry and MAI playground, with future plans to bring MAI-Image-2 to Bing and PowerPoint. Developers can check out pricing info here. These new models are a clear sign that Microsoft is looking to expand its offerings across the AI market. Microsoft's Copilot is one of the most popular chatbots for businesses, especially those who already use Microsoft's Office 360 suite and Azure cloud service. Aside from the now-outdated original image model, Microsoft has primarily focused on text-based models, trying to distinguish itself among its many competitors as a secure, enterprise-friendly option. Its newest AI tools, Copilot Cowork and Copilot Health, are proof of that. However, these new models are a reminder that Microsoft, as a legacy tech company, has the cash and compute to burn on these kinds of "side quests" that even billion-dollar start-ups like OpenAI can't always afford to do. Last week, OpenAI confirmed that it will be discontinuing its Sora AI video app, citing that it will refocus on core activities. The AI industry in 2026 has been aiming to prove its tools are useful in the workplace, especially with Anthropic's Claude Code leapfrogging the competition. Generative media, like the models that power AI image and video generation, require a lot of compute and energy to run, which could be spent elsewhere. Google, as another legacy tech company with billions of its budget allocated to AI research, indicated this week that it won't be giving up on generative media but will be trying to make models more cost- and energy-efficient, as with its new Veo 3.1 Lite video model.
[3]
Microsoft releases new AI models to expand further beyond OpenAI
Microsoft is expanding its roster of in-house AI models, releasing a new speech-to-text system and making two existing models broadly available to developers for the first time. The moves by Microsoft AI (MAI) are part of a broader effort by the company to expand its proprietary AI capabilities beyond its partnership with OpenAI, giving Microsoft more control over its own destiny in the competition against Google, Amazon, and others. Microsoft announced MAI-Transcribe-1 on Thursday, a speech-to-text model that it says is the most accurate currently available. The company also released its existing voice and image generation models, known as MAI-Voice-1 and MAI-Image-2, for broad commercial use. It's Microsoft's first major model release since a March reorganization, announced by CEO Satya Nadella, in which Microsoft AI CEO Mustafa Suleyman shifted away from day-to-day Copilot oversight to focus on frontier model development and superintelligence. Suleyman told The Verge that the transcription model runs at "half the GPU cost of the other state-of-the-art models." He told VentureBeat that the model was built by a team of just 10 people, and that Microsoft plans to eventually build a frontier large language model to be "completely independent" if needed. Microsoft also recently hired former Allen Institute for CEO Ali Farhadi and other top AI researchers from the Seattle-based institute to further bolster Suleyman's team, as GeekWire reported last week. MAI-Transcribe-1 is designed to handle noisy real-world conditions such as call centers and conference rooms, and Microsoft says it is testing integrations with Copilot and Teams. Microsoft says it offers the best price-performance of any large cloud provider, competing directly with OpenAI's Whisper and Google's Gemini on the FLEURS benchmark. In a blog post, Suleyman called the model "not just the most accurate but also lightning fast." MAI-Voice-1 generates natural-sounding speech and now lets developers create custom voices from short snippets of sample audio. MAI-Image-2 ranks in the top three on the Arena.ai image generation leaderboard and is rolling out in Bing and PowerPoint. All three are available on the Microsoft Foundry developer AI platform and MAI Playground.
[4]
Microsoft launches new high-speed voice and image models - SiliconANGLE
Microsoft Corp. today introduced a trio of artificial intelligence models optimized to process images and audio. The algorithms are available through Microsoft Foundry, an Azure service that developers can use to build AI applications. The tech giant has also started rolling out the models to a number of other products. The first new algorithm, MAI-Image-2, can generate images with a resolution of up to 1024 by 1024 pixels based on user instructions. Each prompt may contain up to 32,000 tokens worth of text. Under the hood, MAI-Image-2 turns instructions into images using 10 billion to 50 billion non-embedding parameters. Non-embedding parameters are model components that focus on generating content rather than preliminary data preparation tasks. Microsoft says that MAI-Image-2 is at least twice as fast as its previous-generation image generator. The second new model that debuted today, MAI-Transcribe-1, also brings significant speed improvements. It can transcribe speech 2.5 times faster than Microsoft's earlier models. MAI-Transcribe-1's other selling point is its accuracy. Microsoft tested the model's mean word error rate, a measure of transcript quality, across 25 languages. MAI-Transcribe-1 logged an error rate of 3.9%, which put it ahead of Gemini 3.1 Flash and OpenAI Group PBC's GPT-Transcribe. One contributor to the model's accuracy is that it includes features for filtering environmental noise. On launch, MAI-Transcribe-1 supports batch transcription. That means the model can only process pre-prepared files such as audiobooks. According to Microsoft, a future update will add the ability to transcribe real-time audio streams. The company is also working on a so-called diarization feature that can split the text of a transcript into speaker-specific segments. The third model that Microsoft introduced today is called MAI-Voice-1. As the name suggests, it's optimized to generate synthetic speech based on user-provided scripts. Customers can choose from one of built-in AI voices or use their own voice. Microsoft says all three models offer competitive pricing compared to competitors. MAI-Image-2 is priced at $5 per 1 million input tokens and $33 per 1 million output tokens. MAI-Transcribe-1 costs $0.36 per hour of transcribed speech, while MAI-Voice-1 starts at $22 per 1 million characters. The models are available through not only Microsoft Foundry but also several other services. Microsoft is currently in the process of rolling out MAI-Image-2 to Bing and PowerPoint, while MAI-Voice-1 is accessible in an audio creation tool called Copilot Audio Expressions.
[5]
Microsoft launches 3 AI models for transcription, image, and speech generation - The Economic Times
Microsoft on Thursday announced three new models from its Microsoft AI (MAI) model family for transcription, image, and speech generation. This includes MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, as Microsoft aims to expand its push into multimodal artificial intelligence (AI) capabilities for developers. Starting today, the models are now available on Microsoft Foundry and the MAI Playground. Formerly Azure AI Studio, Foundry is a unified AI platform to build, customise, and scale generative AI (GenAI) applications and agents. Meanwhile, Playground is its public testing environment where users can experiment with features and provide feedback. "Consistent with our commitment to safe and responsible AI, these MAI models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers get built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale," wrote Mustafa Suleyman in a blog post. Suleyman leads the AI division at Microsoft. MAI-Transcribe-1 is a speech-to-text model that can support transcription across the 25 most widely used languages, including Hindi. According to Microsoft, the model produces fewer mean word errors (WER) than even Google's Gemini 3.1 Flash and OpenAI's GPT-Transcribe. WER evaluates the accuracy of Automatic Speech Recognition (ASR) systems by measuring the percentage of words a model gets wrong. The model offers batch transcription speeds up to 2.5 times faster than Microsoft's existing Azure Fast offering. The starting price of the model is $0.36 per hour. Meanwhile, using MAI-Voice-1, developers will be able to create custom voices with a few seconds of input audio. The model can generate up to 60 seconds of audio in one second, with pricing starting at $22 per one million characters. Finally, MAI-Image-2, Microsoft's latest image generation model, introduced only in the MAI Playground last month, is now broadly accessible via Foundry. The model delivers at least twice the generation speed compared to earlier versions, based on production data, while maintaining output quality. Pricing starts at $5 per one million text tokens and $33 per one million image tokens. The models are also being integrated into Microsoft products, including Copilot, Bing, and PowerPoint, with enterprise adoption already underway.
[6]
Microsoft Releases AI Models for Transcription, Voice and Image Generation
Microsoft unveiled three new artificial intelligence models offering speech-to-text transcription as well as voice and image generation. The software giant said Thursday it's working to deploy the models to power its consumer and commercial products, and they are now available for its Foundry customers. One of the new models, MAI-Transcribe-1, offers speech-to-text transcription across 25 languages. The model transcribes more than two times faster than Microsoft's existing Azure Fast offering, the company said. The company's MAI-Voice-1 offering, meanwhile, aims to generate natural, realistic speech. Foundry users will also be able to create their own custom voice using a few seconds of audio. Microsoft's image generation model, MAI-Image-2, is already in use across some enterprise partners including marketing and communications firm WPP, Microsoft said. The model allows users to generate images quickly with natural lighting, accurate skin tones and textures, the company said. Microsoft has faced stumbling blocks in the race for dominance in AI. The company's Copilot chatbot, a product central to its AI strategy, hasn't won over users as a clear ChaptGPT alternative, and Wall Street has grown concerned that growth in its most important business unit, the Azure cloud-computing business, is slowing. Microsoft continues to double down on its AI efforts, with plans to invest billions globally in AI computing as demand booms.
Share
Share
Copy Link
Microsoft released three foundational AI models designed to generate text, voice, and images, signaling its push to build multimodal AI capabilities independent of OpenAI. The models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—promise faster performance and competitive pricing compared to rivals like Google and OpenAI, while being integrated into Microsoft products including Copilot, Bing, and PowerPoint.
Microsoft AI announced the release of three foundational AI models on Thursday, marking a strategic push to build out its own stack of multimodal AI capabilities and reduce dependence on its longtime partner OpenAI
1
.
Source: SiliconANGLE
The trio of AI models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—can generate text from speech, create synthetic audio, and produce images, respectively. All three are now available on Microsoft Foundry and MAI Playground, with plans to integrate them into Microsoft products including Copilot, Bing, and PowerPoint .
The models were developed by Microsoft's MAI Superintelligence team, led by Mustafa Suleyman, CEO of Microsoft AI. Suleyman emphasized the company's distinct approach in a blog post, stating that Microsoft is "building Humanist AI" by "putting humans at the center, optimizing for how people actually communicate, training for practical use"
1
. This release follows a March reorganization where Suleyman shifted away from day-to-day Copilot oversight to focus on frontier model development and superintelligence3
.
Source: GeekWire
The speech-to-text model MAI-Transcribe-1 represents Microsoft's first foray into transcription technology. It supports 25 widely used languages, including Hindi, and delivers batch transcription speeds up to 2.5 times faster than Microsoft's existing Azure Fast offering
5
. Suleyman told The Verge that the model runs at "half the GPU cost of the other state-of-the-art models" and was built by a team of just 10 people3
.Microsoft tested the model's mean word error rate across 25 languages, achieving a 3.9% error rate that outperformed both Google's Gemini 3.1 Flash and OpenAI's GPT-Transcribe
4
. The model is designed to handle noisy real-world conditions such as call centers and conference rooms, with features for filtering environmental noise [4](https://siliconangle.com/2026/04/02/microsoft-l aunches-new-high-speed-voice-image-models/). At launch, MAI-Transcribe-1 supports batch transcription of pre-prepared files, with future updates planned to add real-time audio stream transcription and diarization features that can split transcripts into speaker-specific segments4
.MAI-Voice-1 enables developers to generate natural-sounding synthetic speech generation and create custom voices from just a few seconds of sample audio
3
. The voice generation model can produce up to 60 seconds of audio in one second, making it suitable for applications ranging from accessibility tools to content creation1
. The model is accessible through Copilot Audio Expressions, an audio creation tool4
.MAI-Image-2, the second generation of Microsoft's in-house AI model for image generation, was originally released on MAI Playground on March 19 before becoming broadly available through Microsoft Foundry
1
. The model can generate images with resolutions up to 1024 by 1024 pixels and processes prompts containing up to 32,000 tokens of text4
. Using 10 billion to 50 billion non-embedding parameters, MAI-Image-2 delivers at least twice the generation speed compared to earlier versions while maintaining output quality4
. The model ranks in the top three on the Arena.ai image generation leaderboard3
.
Source: CNET
In an increasingly crowded LLM market, Microsoft positions these high-speed voice and image models as more cost-effective alternatives to offerings from Google and OpenAI
1
. MAI-Transcribe-1 starts at $0.36 per hour, MAI-Voice-1 at $22 per 1 million characters, and MAI-Image-2 at $5 for 1 million tokens for text input and $33 for 1 million tokens for image output1
. Microsoft claims this represents the best price-performance of any large cloud provider, competing directly with OpenAI's Whisper and Google's Gemini3
.Related Stories
Despite releasing its own multimodal AI capabilities, Suleyman reaffirmed Microsoft's commitment to its partnership with OpenAI in an interview with VentureBeat. However, a recent renegotiation of that partnership allowed Microsoft to truly pursue superintelligence research
1
. Microsoft has invested more than $13 billion into the AI research lab and hosts OpenAI models in its various products through a multi-year partnership1
.Suleyman told VentureBeat that Microsoft plans to eventually build a frontier large language model to be "completely independent" if needed
3
. To bolster these efforts, Microsoft recently hired former Allen Institute for AI CEO Ali Farhadi and other top AI researchers from the Seattle-based institute3
.As a legacy tech company, Microsoft has the cash and compute resources to invest in generative media models that even billion-dollar startups like OpenAI struggle to sustain
2
. Last week, OpenAI confirmed it will discontinue its Sora AI video app, citing a need to refocus on core activities2
. Google, another legacy tech company with significant AI research budgets, indicated this week it will continue investing in generative media while focusing on cost and energy efficiency, as demonstrated by its new Veo 3.1 Lite video model2
.Consistent with Microsoft's commitment to safe and responsible AI, these models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers receive built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale
5
. Suleyman wrote that users will "see more models from us soon in Foundry and directly in Microsoft products and experiences"1
, signaling Microsoft's ongoing commitment to expanding its proprietary AI capabilities in the competitive landscape against Google, Amazon, and others.Summarized by
Navi
29 Aug 2025•Technology

Yesterday•Technology

20 Mar 2026•Technology

1
Science and Research

2
Technology

3
Policy and Regulation
