5 Sources
5 Sources
[1]
Mistral releases a new open-source model for speech generation | TechCrunch
French AI company Mistral released a new open-source text-to-speech model on Thursday that can be used by voice AI assistants or in enterprise use cases like customer support. The model, which lets enterprises build voice agents for sales and customer engagement, puts Mistral in direct competition with the likes of ElevenLabs, Deepgram, and OpenAI. The new model, called Voxtral TTS, supports nine languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. "Our customers have been asking for a speech model. So we built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance," Pierre Stock, vp of science operations at Mistral AI, told TechCrunch during a phone interview. Mistral said the new model can adapt a custom voice with a sample of less than five seconds, and also capture characteristics like subtle accents, inflections, intonations, and irregularities in the flow of speech. The model, based on Ministral 3B, can switch between languages easily without losing the characteristics of the voice, which is useful for use cases like dubbing or real-time translation. Stock said the company wanted the model to sound human and not robotic. The model has been built for real-time performance, according to the company. It has a time-to-first-audio (TTFA) -- a measure of when the model starts 'speaking' after receiving input -- of 90ms for a 10-second sample of 500 characters. The model also has a real-time factor (RTF) of 6x, which means it can render a 10-second clip in roughly 1.6 seconds. Earlier this year, Mistral launched a pair of transcription models, one for large batch processing and the other for real-time use cases with low latency. With the new speech model, the company is likely aiming to provide a full suite of voice products to enterprises. "We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well. The main benefit of that is you get way more information with an end-to-end agentic system that supports audio as an input or output," Stock said. Mistral's positioning is that its open source and customization bit will help enterprises adopt its voice models over competitors, as they can tune it the way they want.
[2]
Mistral AI just released a text-to-speech model it says beats ElevenLabs -- and it's giving away the weights for free
The enterprise voice AI market is in the middle of a land grab. ElevenLabs and IBM announced a collaboration just this week to bring premium voice capabilities into IBM's watsonx Orchestrate platform. Google Cloud has been expanding its Chirp 3 HD voices. OpenAI continues to iterate on its own speech synthesis. And the market underpinning all of this activity is enormous -- voice AI crossed $22 billion globally in 2026, with the voice AI agents segment alone projected to reach $47.5 billion by 2034, according to industry estimates. On Thursday morning, Mistral AI entered that fight with a fundamentally different proposition. The Paris-based AI startup released Voxtral TTS, what it calls the first frontier-quality, open-weight text-to-speech model designed specifically for enterprise use. Where every major competitor in the space operates a proprietary, API-first business -- enterprises rent the voice, they don't own it -- Mistral is releasing the full model weights, inviting companies to download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party. It is a bet that the future of enterprise voice AI will not be shaped by whoever builds the best-sounding model, but by whoever gives companies the most control over it. And it arrives at a moment when Mistral, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September, has been aggressively assembling the building blocks of a complete, enterprise-owned AI stack -- from its Forge customization platform announced at Nvidia GTC earlier this month, to its AI Studio production infrastructure, to the Voxtral Transcribe speech-to-text model released just weeks ago. Voxtral TTS is the output layer that completes that picture, giving enterprises a speech-to-speech pipeline they can run end-to-end without relying on any external provider. "We see audio as a big bet and as a critical and maybe the only future interface with all the AI models," Pierre Stock, Mistral's vice president of science and the first employee hired at the company, said in an exclusive interview with VentureBeat. "This is something customers have been asking for." A 3-billion-parameter model that fits on a laptop and runs six times faster than real-time speech The technical specifications of Voxtral TTS read like a deliberate inversion of industry norms. Where most frontier TTS models are large and resource-intensive, Mistral built its model to be roughly three times smaller than what it calls the industry standard for comparable quality. The architecture comprises three components: a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec that Mistral developed in-house. The system is built on top of Ministral 3B, the same pretrained backbone that powers the company's Voxtral Transcribe model -- a design choice that Stock described as emblematic of Mistral's culture of efficiency and artifact reuse. In practice, the model achieves a time-to-first-audio of 90 milliseconds for a typical input and generates speech at approximately six times real-time speed. When quantized for inference, it requires roughly three gigabytes of RAM. Stock confirmed it can run on any laptop or smartphone, and even on older hardware it still operates in real time. "It's a 3B model, so it can basically run on any laptop or any smartphone," Stock told VentureBeat. "If you quantize it to infer, it's actually three gigabytes of RAM. And you can run it on super old chips -- it's still going to be real time." The model supports nine languages -- English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic -- and can adapt to a custom voice with as little as five seconds of reference audio. Perhaps more remarkably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task. Stock illustrated this with a personal example: he can feed the model 10 seconds of his own French-accented voice, type a prompt in German, and the model will generate German speech that sounds like him -- complete with his natural accent and vocal characteristics. For enterprises operating across borders, this capability unlocks cascaded speech-to-speech translation that preserves speaker identity, a feature that has obvious applications in customer support, sales, and internal communications for multinational organizations. Human evaluators preferred Voxtral over ElevenLabs nearly 70 percent of the time on voice customization Mistral is not being coy about which competitor it intends to displace. In human evaluations conducted by the company, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks. Mistral also claims the model performs at parity with ElevenLabs v3 -- the company's premium, higher-latency tier -- on emotional expressiveness, while maintaining similar latency to the much faster Flash model. The evaluation methodology involved a comparative side-by-side test across all nine supported languages. Using two recognizable voices in their native dialects for each language, three annotators performed preference tests on naturalness, accent adherence, and acoustic similarity to the original reference. Mistral says Voxtral TTS widened the quality gap to ElevenLabs v2.5 Flash especially in zero-shot multilingual custom voice settings, highlighting what the company calls the "instant customizability" of the model. ElevenLabs remains widely regarded as the benchmark for raw voice quality. Its Eleven v3 model has been described by multiple independent reviewers as the gold standard for emotionally nuanced AI speech. But ElevenLabs operates as a closed platform with tiered subscription pricing that scales from around $5 per month at the starter level to over $1,300 per month for business plans. It does not release model weights. Mistral's pitch is that enterprises shouldn't have to choose between quality and control -- and that at scale, the economics of an open-weight model are dramatically more favorable. "What we want to underline is that we're faster and cheaper as well -- and open source," Stock told VentureBeat. "When something is open source and cheap, people adopt it and people build on it." He framed the cost argument in terms that resonate with CTOs managing AI budgets: "AI is a transformative technology, but it has a cost. When you want to scale and have impact on a large business, that cost matters. And what we allow is to scale seamlessly while minimizing the cost and maximizing the accuracy." Why Mistral thinks enterprises will want to own their voice AI rather than rent it To understand why Mistral is entering text-to-speech now, you have to understand the broader strategic architecture the company has been building for the past year. While OpenAI and Anthropic have captured the imagination of consumers, Mistral has quietly assembled what may be the most comprehensive enterprise AI platform in Europe -- and increasingly, globally. CEO Arthur Mensch has said the company is on track to surpass $1 billion in annual recurring revenue this year, according to TechCrunch's reporting on the Forge launch. The Financial Times has reported that Mistral's annualized revenue run rate surged from $20 million to over $400 million within a single year. That growth has been powered by more than 100 major enterprise customers and a consistent thesis: companies should own their AI infrastructure, not rent it. Voxtral TTS is the latest expression of that thesis, applied to what may be the most sensitive category of enterprise data there is. Voice recordings capture not just words but emotion, identity, and intent. They carry legal, regulatory, and reputational weight that text data often does not. For industries like financial services, healthcare, and government -- all key Mistral verticals -- sending voice data to a third-party API introduces risks that many compliance teams are unwilling to accept. Stock made the data sovereignty argument forcefully. "Since the models are open weights, we have no trouble and no problem actually giving the weights to the enterprise and helping them customize the models," he said. "We don't see the weights anymore. We don't see the data. We see nothing. And you are fully controlled." That message has particular resonance in Europe, where concern about technological dependence on American cloud providers has intensified throughout 2026. The EU currently sources more than 80 percent of its digital services from foreign providers, most of them American. Mistral has positioned itself as the answer to that anxiety -- the only European frontier AI developer with the scale and technical capability to offer a credible alternative. Voice agents are the enterprise use case that makes Mistral's full AI stack click into place Voxtral TTS is the final piece in a pipeline Mistral has been methodically assembling. Voxtral Transcribe handles speech-to-text. Mistral's language models -- from Mistral Small to Mistral Large -- provide the reasoning layer. Forge allows enterprises to customize any of these models on their own data. AI Studio provides the production infrastructure for observability, governance, and deployment. And Mistral Compute offers the underlying GPU resources. Together, these pieces form what Stock described as a "full AI stack, fully controllable and customizable" for the enterprise. Voice agents -- AI systems that can listen to a customer, understand what they need, reason about the answer, and respond in natural-sounding speech -- are the use case that ties all of these layers together. The applications Mistral envisions span customer support, where voice agents can route and resolve queries with brand-appropriate speech; sales and marketing, where a single voice can work across markets through cross-lingual emulation; real-time translation for cross-border operations; and even interactive storytelling and game design, where emotion-steering can control tone and personality. Stock was most animated when discussing how Voxtral TTS fits into the broader agentic AI trend that has dominated enterprise technology discussions in 2026. "We are totally building for a world in which audio is a natural interface, in particular for agents to which you can delegate work -- extensions of yourself," he said. He described a scenario in which a user starts planning a vacation on a computer, commutes to work, and then picks up the workflow on a phone simply by asking for an update by voice. "To make that happen, you need a model you can trust, you need a model that's super efficient and super cheap to run -- otherwise you won't use it for long -- and you need a model that sounds super conversational and that you can interrupt at any time," Stock said. That emphasis on interruptibility and real-time responsiveness reflects a broader insight about voice interfaces that distinguishes them from text. A chatbot can take two or three seconds to respond without breaking the user experience. A voice agent cannot. The 90-millisecond time-to-first-audio that Voxtral TTS achieves is not just a benchmark number -- it is the threshold between a voice interaction that feels natural and one that feels robotic. Mistral's open-weight approach aligns with a broader industry shift that even Nvidia is backing Mistral's decision to release Voxtral TTS with open weights is consistent with a movement that has been gathering momentum across the AI industry. At Nvidia GTC earlier this month, Nvidia CEO Jensen Huang declared that "proprietary versus open is not a thing -- it's proprietary and open." Nvidia announced the Nemotron Coalition, a first-of-its-kind collaboration of model builders working to advance open frontier-level foundation models, with Mistral as a founding member. The first project from that coalition will be a base model codeveloped by Mistral AI and Nvidia. For Mistral, open weights serve a dual commercial purpose. They drive adoption -- developers and enterprises can experiment without friction or commitment -- while the company monetizes through its platform services, customization offerings, and managed infrastructure. The model is available to test in Mistral Studio and through the company's API, but the strategic play is to become embedded in enterprise voice pipelines as an owned asset, not a metered service. This mirrors the playbook that worked for Mistral's language models. As Mensch told CNBC in February, "AI is making us able to develop software at the speed of light," predicting that "more than half of what's currently being bought by IT in terms of SaaS is going to shift to AI." He described a "replatforming" taking place across enterprise technology, with businesses looking to replace legacy software systems with AI-native alternatives. An open-weight voice model that enterprises can customize and deploy on their own terms fits naturally into that narrative. Mistral signals that end-to-end audio AI is where the company is headed next When asked what comes after Voxtral TTS, Stock outlined two directions. The first is expanding language and dialect support, with particular attention to cultural nuance. "It's not the same to speak French in Paris than to speak French in Canada, in Montreal," he said. "We want to respect both cultures, and we want our models to perform in both contexts with all the cultural specifics." The second direction is more ambitious: a fully end-to-end audio model that doesn't just generate speech from text but understands the complete spectrum of human vocal communication. "We convey some meaning with the words we speak," Stock said. "We actually convey way more with the intonation, the rhythm, and how we say it. When people talk about end-to-end audio, that's what they mean -- the model is able to pick up that you're in a hurry, for instance, and will go for the fastest answer. The model will know that you're joyful today and crack a joke. It's super adaptive to you, and that's where we want to go." That vision -- an AI that speaks naturally, listens with nuance, responds with emotional intelligence, and runs on a model small enough to fit in your pocket -- is the frontier every major AI lab is racing toward. For now, Voxtral TTS gives Mistral a foundation to build on and enterprises a question they haven't had to answer before: if you could own your voice AI stack outright, at lower cost and with competitive quality, why would you keep renting someone else's?
[3]
Mistral releases an open-weights 'speaking' AI model with Voxtral TTS - SiliconANGLE
Mistral releases an open-weights 'speaking' AI model with Voxtral TTS The Paris-based Mistral AI SAS today announced the release of Voxtral TTS, its first text-to-speech artificial intelligence model aimed at unseating the best-known and most powerful voice models on the market. The new model is very lightweight, with four billion parameters, which makes it a size that can be run on most consumer hardware, including modern laptops, mid-range desktop graphics processing units and even some high-end mobile devices (at high compression). The company is releasing it with open-weights, which means that it's an open-source model. Mistral said that the highlights of the model make it highly adaptable for new voices and it has a very low delay for time for new audio, producing a quick response time. Although the model is small, it still creates powerful voices. The company said it not only recites but interprets text accurately, a must for any text-to-speech generation. It is capable of producing emotionality and tonality fitting to oration, for example neutral, happy, sarcastic and so on. The objective is to capture how a person would naturally speak. Even in English, the voice capability includes American, English and French dialects. Competition against proprietary large language speech models is intense, so Mistral compared it to ElevenLabs Inc., the incumbent to beat. For voice agents, the company said human evaluations show Voxtral TTS shows naturalness compared to ElevenLabs Flash v2.5 and also performs at parity to the larger v3 model in more lifelike interactions. Although the English market is quite large, Mistral is a French company; as a result, Voxtral TTS is a multilingual model. The company said it was trained on a large speech dataset and was built for global applications. It supports state-of-the-art performance in nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi and Arabic. The model can be trained to adapt and voice-clone with a reference of as little as three seconds. It can capture not just the voice but nuances like subtle accent, inflections, intonations and even casual vocal fillers such as "ums," "ahs," other interruptions, pauses and repetitions natural to the speaker's rhythm and cadence. This level of fidelity, in addition to the small size and open weights, means that Mistral is betting that enterprise companies will want to own their own voice models and run them on their own systems locally. It also provides the foundation for more powerful text-to-speech AI models that provide even more texture, customization and power in the future that Mistral can provide for enterprise environments. Users can get started with the model today in Mistral Studio or Le Chat. The open model is available for developers with several reference voices and can be downloaded from Hugging Face under a Creative Commons license.
[4]
Mistral unveils open source TTS model for voice agents
French AI company Mistral has launched Voxtral TTS, an open source text-to-speech model designed for voice AI assistants and enterprise applications such as customer support. The model targets businesses looking to build voice agents for sales and engagement while placing Mistral in competition with companies like ElevenLabs and OpenAI. Voxtral TTS supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model aims to meet customer demands for a speech model, according to Pierre Stock, VP of science operations at Mistral AI. Stock stated, "We built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance." The model can adapt to a custom voice with a sample of less than five seconds, capturing accents, inflections, intonations, and speech irregularities. Voxtral TTS is based on Mistral 3B and can switch languages without losing voice characteristics, which benefits use cases like dubbing and real-time translation. Voxtral TTS is built for real-time functionality, with a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second sample of 500 characters. The model features a real-time factor (RTF) of 6x, allowing it to render a 10-second clip in approximately 1.6 seconds. Earlier in 2023, Mistral launched two transcription models aimed at large batch processing and low-latency real-time applications. The introduction of Voxtral TTS aligns with Mistral's goal to create a complete suite of voice products for enterprise use. Stock added, "We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image." This aims to provide richer information through a system capable of supporting various data types. Mistral emphasizes that its open-source and customization capabilities are intended to encourage enterprises to adopt its models over competitors. Companies will have the ability to tailor the technology to their needs, offering potential advantages in customer engagement.
[5]
Mistral's Open-Source Voice Model Sparks New AI Assistant Rivalry
Mistral Launches Open-Source Voice Model That Could Lower Costs and Intensify Competition in the Fast-Growing AI Assistant Space Mistral AI has made a move that has surprised the AI world. The French startup has released a new open-source voice model. This new tool is designed to power AI assistants and other voice-based tools. The idea is simple: make voice technology cheaper, faster, and easier for everyone. This sudden launch comes at a time when voice is gradually becoming a key part of everyday technology. From customer support bots to smart devices, businesses want assistants that can talk naturally and respond quickly. With this new model, Mistral is trying to win over developers and companies who want flexibility without incurring huge costs.
Share
Share
Copy Link
French AI company Mistral AI launched Voxtral TTS, an open-source text-to-speech model that runs on edge devices from smartwatches to laptops. The 3-billion-parameter model supports nine languages, adapts custom voices in under five seconds, and achieves 90ms time-to-first-audio. Human evaluations showed 69.9% preference over ElevenLabs in voice customization tasks, positioning Mistral to compete directly in the enterprise voice AI market.
Mistral AI released Voxtral TTS on Thursday, marking the French AI company's entry into the enterprise voice AI market with a fundamentally different strategy than its competitors
1
. While ElevenLabs, OpenAI, and others operate proprietary, API-first businesses where enterprises rent voice capabilities, Mistral is releasing full model weights, allowing companies to download Voxtral TTS and run it on their own servers or even smartphones without sending audio data to third parties2
. This open-source voice model positions Mistral AI directly against established players in a market that crossed $22 billion globally in 2026, with voice agents alone projected to reach $47.5 billion by 20342
.
Source: Analytics Insight
The text-to-speech model supports nine languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic, enabling voice AI assistants and enterprise applications like customer support and sales engagement
1
3
. Pierre Stock, VP of science operations at Mistral AI, explained that customers had been requesting a speech model, leading the company to build a solution that fits on edge devices at a fraction of market costs while delivering frontier-quality text-to-speech performance1
.Voxtral TTS comprises a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec developed in-house
2
. Built on Ministral 3B, the same pretrained backbone powering Mistral's transcription models, this architecture enables the model to run on laptops, smartphones, smartwatches, and other edge devices1
3
.
Source: TechCrunch
When quantized for inference, the model requires roughly three gigabytes of RAM and can operate in real time even on older hardware
2
. The system achieves a time-to-first-audio of 90 milliseconds for a 10-second sample of 500 characters and generates speech at approximately six times real-time speed, rendering a 10-second clip in roughly 1.6 seconds1
2
. This real-time performance makes it suitable for low latency applications where immediate response matters.The model demonstrates remarkable custom voice adaptation, requiring less than five seconds of reference audio to clone a voice while capturing subtle accents, inflections, intonations, and irregularities in speech flow
1
. Voxtral TTS can adapt with as little as three seconds of audio, capturing not just voice but nuances like vocal fillers such as "ums" and "ahs," pauses, and repetitions natural to a speaker's rhythm3
.
Source: VentureBeat
Stock illustrated the model's zero-shot cross-lingual voice adaptation capability with a personal example: he can provide 10 seconds of his French-accented voice, type a prompt in German, and the model generates German speech that sounds like him complete with his natural accent
2
. The model switches between languages easily without losing voice characteristics, making it valuable for dubbing and real-time translation use cases1
. The company designed the model to produce human-like voice generation that sounds natural rather than robotic, with emotionality and tonality fitting to oration including neutral, happy, and sarcastic tones3
.Related Stories
In human evaluations conducted by Mistral AI, Voxtral TTS achieved a 62.8% listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9% preference rate in voice customization tasks
2
. The company claims the model performs at parity with ElevenLabs v3, positioning it competitively against the incumbent leader in enterprise voice AI2
3
.This performance comes from a model roughly three times smaller than what Mistral describes as the industry standard for comparable quality
2
. The combination of small size, open-weights availability, and high fidelity means Mistral is betting that enterprise companies will prefer to own their voice models and run them locally on their own systems rather than relying on external providers3
.Voxtral TTS completes a broader vision for Mistral AI, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September
2
. Earlier this year, the company launched transcription models for large batch processing and real-time use cases with low latency1
4
. Combined with its Forge customization platform announced at Nvidia GTC and AI Studio production infrastructure, Voxtral TTS provides the output layer for a speech-to-speech pipeline that enterprises can run end-to-end2
.Stock stated that the company plans to have an end-to-end multimodal platform that can handle streams of input including audio, text, and image, with the main benefit being richer information through an agentic system supporting various data types
1
4
. "We see audio as a big bet and as a critical and maybe the only future interface with all the AI models," Stock told VentureBeat2
.Mistral's positioning centers on how its open-source approach and customization capabilities will help enterprises adopt its voice models over competitors, allowing companies to tune the technology to their specific needs for voice agents, customer engagement, and enterprise applications
1
4
. Users can access the model through Mistral Studio or Le Chat, with open-weights text-to-speech available for developers to download from Hugging Face under a Creative Commons license3
. This release intensifies competition in the fast-growing AI assistant space, potentially lowering costs and expanding access to advanced voice technology5
.Summarized by
Navi
[2]
[3]
[4]
[5]
16 Jul 2025•Technology

04 Feb 2026•Technology

08 May 2025•Technology

1
Policy and Regulation

2
Technology

3
Policy and Regulation
