Mistral AI releases Voxtral TTS, an open-source voice model challenging ElevenLabs and OpenAI

Reviewed byNidhi Govil

5 Sources

Share

French AI company Mistral AI launched Voxtral TTS, an open-source text-to-speech model that runs on edge devices from smartwatches to laptops. The 3-billion-parameter model supports nine languages, adapts custom voices in under five seconds, and achieves 90ms time-to-first-audio. Human evaluations showed 69.9% preference over ElevenLabs in voice customization tasks, positioning Mistral to compete directly in the enterprise voice AI market.

Mistral AI Enters Enterprise Voice AI Market With Open-Weights Approach

Mistral AI released Voxtral TTS on Thursday, marking the French AI company's entry into the enterprise voice AI market with a fundamentally different strategy than its competitors

1

. While ElevenLabs, OpenAI, and others operate proprietary, API-first businesses where enterprises rent voice capabilities, Mistral is releasing full model weights, allowing companies to download Voxtral TTS and run it on their own servers or even smartphones without sending audio data to third parties

2

. This open-source voice model positions Mistral AI directly against established players in a market that crossed $22 billion globally in 2026, with voice agents alone projected to reach $47.5 billion by 2034

2

.

Source: Analytics Insight

Source: Analytics Insight

The text-to-speech model supports nine languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic, enabling voice AI assistants and enterprise applications like customer support and sales engagement

1

3

. Pierre Stock, VP of science operations at Mistral AI, explained that customers had been requesting a speech model, leading the company to build a solution that fits on edge devices at a fraction of market costs while delivering frontier-quality text-to-speech performance

1

.

Compact Architecture Delivers Real-Time Performance on Consumer Hardware

Voxtral TTS comprises a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec developed in-house

2

. Built on Ministral 3B, the same pretrained backbone powering Mistral's transcription models, this architecture enables the model to run on laptops, smartphones, smartwatches, and other edge devices

1

3

.

Source: TechCrunch

Source: TechCrunch

When quantized for inference, the model requires roughly three gigabytes of RAM and can operate in real time even on older hardware

2

. The system achieves a time-to-first-audio of 90 milliseconds for a 10-second sample of 500 characters and generates speech at approximately six times real-time speed, rendering a 10-second clip in roughly 1.6 seconds

1

2

. This real-time performance makes it suitable for low latency applications where immediate response matters.

Custom Voice Adaptation and Human-Like Voice Generation Capabilities

The model demonstrates remarkable custom voice adaptation, requiring less than five seconds of reference audio to clone a voice while capturing subtle accents, inflections, intonations, and irregularities in speech flow

1

. Voxtral TTS can adapt with as little as three seconds of audio, capturing not just voice but nuances like vocal fillers such as "ums" and "ahs," pauses, and repetitions natural to a speaker's rhythm

3

.

Source: VentureBeat

Source: VentureBeat

Stock illustrated the model's zero-shot cross-lingual voice adaptation capability with a personal example: he can provide 10 seconds of his French-accented voice, type a prompt in German, and the model generates German speech that sounds like him complete with his natural accent

2

. The model switches between languages easily without losing voice characteristics, making it valuable for dubbing and real-time translation use cases

1

. The company designed the model to produce human-like voice generation that sounds natural rather than robotic, with emotionality and tonality fitting to oration including neutral, happy, and sarcastic tones

3

.

Human Evaluations Show Strong Performance Against ElevenLabs

In human evaluations conducted by Mistral AI, Voxtral TTS achieved a 62.8% listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9% preference rate in voice customization tasks

2

. The company claims the model performs at parity with ElevenLabs v3, positioning it competitively against the incumbent leader in enterprise voice AI

2

3

.

This performance comes from a model roughly three times smaller than what Mistral describes as the industry standard for comparable quality

2

. The combination of small size, open-weights availability, and high fidelity means Mistral is betting that enterprise companies will prefer to own their voice models and run them locally on their own systems rather than relying on external providers

3

.

Building Toward Complete Enterprise-Owned AI Stack

Voxtral TTS completes a broader vision for Mistral AI, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September

2

. Earlier this year, the company launched transcription models for large batch processing and real-time use cases with low latency

1

4

. Combined with its Forge customization platform announced at Nvidia GTC and AI Studio production infrastructure, Voxtral TTS provides the output layer for a speech-to-speech pipeline that enterprises can run end-to-end

2

.

Stock stated that the company plans to have an end-to-end multimodal platform that can handle streams of input including audio, text, and image, with the main benefit being richer information through an agentic system supporting various data types

1

4

. "We see audio as a big bet and as a critical and maybe the only future interface with all the AI models," Stock told VentureBeat

2

.

Mistral's positioning centers on how its open-source approach and customization capabilities will help enterprises adopt its voice models over competitors, allowing companies to tune the technology to their specific needs for voice agents, customer engagement, and enterprise applications

1

4

. Users can access the model through Mistral Studio or Le Chat, with open-weights text-to-speech available for developers to download from Hugging Face under a Creative Commons license

3

. This release intensifies competition in the fast-growing AI assistant space, potentially lowering costs and expanding access to advanced voice technology

5

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo