2 Sources
2 Sources
[1]
Mistral releases a new open-source model for speech generation | TechCrunch
French AI company Mistral released a new open-source text-to-speech model on Thursday that can be used by voice AI assistants or in enterprise use cases like customer support. The model, which lets enterprises build voice agents for sales and customer engagement, puts Mistral in direct competition with the likes of ElevenLabs, Deepgram, and OpenAI. The new model, called Voxtral TTS, supports nine languages, including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. "Our customers have been asking for a speech model. So we built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices. The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance," Pierre Stock, vp of science operations at Mistral AI, told TechCrunch during a phone interview. Mistral said the new model can adapt a custom voice with a sample of less than five seconds, and also capture characteristics like subtle accents, inflections, intonations, and irregularities in the flow of speech. The model, based on Ministral 3B, can switch between languages easily without losing the characteristics of the voice, which is useful for use cases like dubbing or real-time translation. Stock said the company wanted the model to sound human and not robotic. The model has been built for real-time performance, according to the company. It has a time-to-first-audio (TTFA) -- a measure of when the model starts 'speaking' after receiving input -- of 90ms for a 10-second sample of 500 characters. The model also has a real-time factor (RTF) of 6x, which means it can render a 10-second clip in roughly 1.6 seconds. Earlier this year, Mistral launched a pair of transcription models, one for large batch processing and the other for real-time use cases with low latency. With the new speech model, the company is likely aiming to provide a full suite of voice products to enterprises. "We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image and output as well. The main benefit of that is you get way more information with an end-to-end agentic system that supports audio as an input or output," Stock said. Mistral's positioning is that its open source and customization bit will help enterprises adopt its voice models over competitors, as they can tune it the way they want.
[2]
Mistral AI just released a text-to-speech model it says beats ElevenLabs -- and it's giving away the weights for free
The enterprise voice AI market is in the middle of a land grab. ElevenLabs and IBM announced a collaboration just this week to bring premium voice capabilities into IBM's watsonx Orchestrate platform. Google Cloud has been expanding its Chirp 3 HD voices. OpenAI continues to iterate on its own speech synthesis. And the market underpinning all of this activity is enormous -- voice AI crossed $22 billion globally in 2026, with the voice AI agents segment alone projected to reach $47.5 billion by 2034, according to industry estimates. On Thursday morning, Mistral AI entered that fight with a fundamentally different proposition. The Paris-based AI startup released Voxtral TTS, what it calls the first frontier-quality, open-weight text-to-speech model designed specifically for enterprise use. Where every major competitor in the space operates a proprietary, API-first business -- enterprises rent the voice, they don't own it -- Mistral is releasing the full model weights, inviting companies to download Voxtral TTS, run it on their own servers or even on a smartphone, and never send a single audio frame to a third party. It is a bet that the future of enterprise voice AI will not be shaped by whoever builds the best-sounding model, but by whoever gives companies the most control over it. And it arrives at a moment when Mistral, valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September, has been aggressively assembling the building blocks of a complete, enterprise-owned AI stack -- from its Forge customization platform announced at Nvidia GTC earlier this month, to its AI Studio production infrastructure, to the Voxtral Transcribe speech-to-text model released just weeks ago. Voxtral TTS is the output layer that completes that picture, giving enterprises a speech-to-speech pipeline they can run end-to-end without relying on any external provider. "We see audio as a big bet and as a critical and maybe the only future interface with all the AI models," Pierre Stock, Mistral's vice president of science and the first employee hired at the company, said in an exclusive interview with VentureBeat. "This is something customers have been asking for." A 3-billion-parameter model that fits on a laptop and runs six times faster than real-time speech The technical specifications of Voxtral TTS read like a deliberate inversion of industry norms. Where most frontier TTS models are large and resource-intensive, Mistral built its model to be roughly three times smaller than what it calls the industry standard for comparable quality. The architecture comprises three components: a 3.4-billion-parameter transformer decoder backbone, a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec that Mistral developed in-house. The system is built on top of Ministral 3B, the same pretrained backbone that powers the company's Voxtral Transcribe model -- a design choice that Stock described as emblematic of Mistral's culture of efficiency and artifact reuse. In practice, the model achieves a time-to-first-audio of 90 milliseconds for a typical input and generates speech at approximately six times real-time speed. When quantized for inference, it requires roughly three gigabytes of RAM. Stock confirmed it can run on any laptop or smartphone, and even on older hardware it still operates in real time. "It's a 3B model, so it can basically run on any laptop or any smartphone," Stock told VentureBeat. "If you quantize it to infer, it's actually three gigabytes of RAM. And you can run it on super old chips -- it's still going to be real time." The model supports nine languages -- English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic -- and can adapt to a custom voice with as little as five seconds of reference audio. Perhaps more remarkably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task. Stock illustrated this with a personal example: he can feed the model 10 seconds of his own French-accented voice, type a prompt in German, and the model will generate German speech that sounds like him -- complete with his natural accent and vocal characteristics. For enterprises operating across borders, this capability unlocks cascaded speech-to-speech translation that preserves speaker identity, a feature that has obvious applications in customer support, sales, and internal communications for multinational organizations. Human evaluators preferred Voxtral over ElevenLabs nearly 70 percent of the time on voice customization Mistral is not being coy about which competitor it intends to displace. In human evaluations conducted by the company, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks. Mistral also claims the model performs at parity with ElevenLabs v3 -- the company's premium, higher-latency tier -- on emotional expressiveness, while maintaining similar latency to the much faster Flash model. The evaluation methodology involved a comparative side-by-side test across all nine supported languages. Using two recognizable voices in their native dialects for each language, three annotators performed preference tests on naturalness, accent adherence, and acoustic similarity to the original reference. Mistral says Voxtral TTS widened the quality gap to ElevenLabs v2.5 Flash especially in zero-shot multilingual custom voice settings, highlighting what the company calls the "instant customizability" of the model. ElevenLabs remains widely regarded as the benchmark for raw voice quality. Its Eleven v3 model has been described by multiple independent reviewers as the gold standard for emotionally nuanced AI speech. But ElevenLabs operates as a closed platform with tiered subscription pricing that scales from around $5 per month at the starter level to over $1,300 per month for business plans. It does not release model weights. Mistral's pitch is that enterprises shouldn't have to choose between quality and control -- and that at scale, the economics of an open-weight model are dramatically more favorable. "What we want to underline is that we're faster and cheaper as well -- and open source," Stock told VentureBeat. "When something is open source and cheap, people adopt it and people build on it." He framed the cost argument in terms that resonate with CTOs managing AI budgets: "AI is a transformative technology, but it has a cost. When you want to scale and have impact on a large business, that cost matters. And what we allow is to scale seamlessly while minimizing the cost and maximizing the accuracy." Why Mistral thinks enterprises will want to own their voice AI rather than rent it To understand why Mistral is entering text-to-speech now, you have to understand the broader strategic architecture the company has been building for the past year. While OpenAI and Anthropic have captured the imagination of consumers, Mistral has quietly assembled what may be the most comprehensive enterprise AI platform in Europe -- and increasingly, globally. CEO Arthur Mensch has said the company is on track to surpass $1 billion in annual recurring revenue this year, according to TechCrunch's reporting on the Forge launch. The Financial Times has reported that Mistral's annualized revenue run rate surged from $20 million to over $400 million within a single year. That growth has been powered by more than 100 major enterprise customers and a consistent thesis: companies should own their AI infrastructure, not rent it. Voxtral TTS is the latest expression of that thesis, applied to what may be the most sensitive category of enterprise data there is. Voice recordings capture not just words but emotion, identity, and intent. They carry legal, regulatory, and reputational weight that text data often does not. For industries like financial services, healthcare, and government -- all key Mistral verticals -- sending voice data to a third-party API introduces risks that many compliance teams are unwilling to accept. Stock made the data sovereignty argument forcefully. "Since the models are open weights, we have no trouble and no problem actually giving the weights to the enterprise and helping them customize the models," he said. "We don't see the weights anymore. We don't see the data. We see nothing. And you are fully controlled." That message has particular resonance in Europe, where concern about technological dependence on American cloud providers has intensified throughout 2026. The EU currently sources more than 80 percent of its digital services from foreign providers, most of them American. Mistral has positioned itself as the answer to that anxiety -- the only European frontier AI developer with the scale and technical capability to offer a credible alternative. Voice agents are the enterprise use case that makes Mistral's full AI stack click into place Voxtral TTS is the final piece in a pipeline Mistral has been methodically assembling. Voxtral Transcribe handles speech-to-text. Mistral's language models -- from Mistral Small to Mistral Large -- provide the reasoning layer. Forge allows enterprises to customize any of these models on their own data. AI Studio provides the production infrastructure for observability, governance, and deployment. And Mistral Compute offers the underlying GPU resources. Together, these pieces form what Stock described as a "full AI stack, fully controllable and customizable" for the enterprise. Voice agents -- AI systems that can listen to a customer, understand what they need, reason about the answer, and respond in natural-sounding speech -- are the use case that ties all of these layers together. The applications Mistral envisions span customer support, where voice agents can route and resolve queries with brand-appropriate speech; sales and marketing, where a single voice can work across markets through cross-lingual emulation; real-time translation for cross-border operations; and even interactive storytelling and game design, where emotion-steering can control tone and personality. Stock was most animated when discussing how Voxtral TTS fits into the broader agentic AI trend that has dominated enterprise technology discussions in 2026. "We are totally building for a world in which audio is a natural interface, in particular for agents to which you can delegate work -- extensions of yourself," he said. He described a scenario in which a user starts planning a vacation on a computer, commutes to work, and then picks up the workflow on a phone simply by asking for an update by voice. "To make that happen, you need a model you can trust, you need a model that's super efficient and super cheap to run -- otherwise you won't use it for long -- and you need a model that sounds super conversational and that you can interrupt at any time," Stock said. That emphasis on interruptibility and real-time responsiveness reflects a broader insight about voice interfaces that distinguishes them from text. A chatbot can take two or three seconds to respond without breaking the user experience. A voice agent cannot. The 90-millisecond time-to-first-audio that Voxtral TTS achieves is not just a benchmark number -- it is the threshold between a voice interaction that feels natural and one that feels robotic. Mistral's open-weight approach aligns with a broader industry shift that even Nvidia is backing Mistral's decision to release Voxtral TTS with open weights is consistent with a movement that has been gathering momentum across the AI industry. At Nvidia GTC earlier this month, Nvidia CEO Jensen Huang declared that "proprietary versus open is not a thing -- it's proprietary and open." Nvidia announced the Nemotron Coalition, a first-of-its-kind collaboration of model builders working to advance open frontier-level foundation models, with Mistral as a founding member. The first project from that coalition will be a base model codeveloped by Mistral AI and Nvidia. For Mistral, open weights serve a dual commercial purpose. They drive adoption -- developers and enterprises can experiment without friction or commitment -- while the company monetizes through its platform services, customization offerings, and managed infrastructure. The model is available to test in Mistral Studio and through the company's API, but the strategic play is to become embedded in enterprise voice pipelines as an owned asset, not a metered service. This mirrors the playbook that worked for Mistral's language models. As Mensch told CNBC in February, "AI is making us able to develop software at the speed of light," predicting that "more than half of what's currently being bought by IT in terms of SaaS is going to shift to AI." He described a "replatforming" taking place across enterprise technology, with businesses looking to replace legacy software systems with AI-native alternatives. An open-weight voice model that enterprises can customize and deploy on their own terms fits naturally into that narrative. Mistral signals that end-to-end audio AI is where the company is headed next When asked what comes after Voxtral TTS, Stock outlined two directions. The first is expanding language and dialect support, with particular attention to cultural nuance. "It's not the same to speak French in Paris than to speak French in Canada, in Montreal," he said. "We want to respect both cultures, and we want our models to perform in both contexts with all the cultural specifics." The second direction is more ambitious: a fully end-to-end audio model that doesn't just generate speech from text but understands the complete spectrum of human vocal communication. "We convey some meaning with the words we speak," Stock said. "We actually convey way more with the intonation, the rhythm, and how we say it. When people talk about end-to-end audio, that's what they mean -- the model is able to pick up that you're in a hurry, for instance, and will go for the fastest answer. The model will know that you're joyful today and crack a joke. It's super adaptive to you, and that's where we want to go." That vision -- an AI that speaks naturally, listens with nuance, responds with emotional intelligence, and runs on a model small enough to fit in your pocket -- is the frontier every major AI lab is racing toward. For now, Voxtral TTS gives Mistral a foundation to build on and enterprises a question they haven't had to answer before: if you could own your voice AI stack outright, at lower cost and with competitive quality, why would you keep renting someone else's?
Share
Share
Copy Link
French AI startup Mistral launched Voxtral TTS, an open-weight text-to-speech model that runs on edge devices from smartwatches to laptops. The 3-billion-parameter model supports nine languages, adapts custom voices in under five seconds, and outperformed ElevenLabs in human evaluation tests. Unlike proprietary competitors, Mistral offers full model weights for enterprises to own and deploy locally.
French AI company Mistral released Voxtral TTS on Thursday, marking its entry into the enterprise voice AI market with a fundamentally different approach than established players
1
. While competitors like ElevenLabs, OpenAI, and Deepgram operate proprietary, API-first businesses where enterprises rent voice capabilities, Mistral is releasing full model weights for its open-source text-to-speech model, allowing companies to download, customize, and run it on their own infrastructure without sending audio data to third parties2
. This positions Mistral AI text-to-speech as a data sovereignty play in a market that crossed $22 billion globally in 2026, with voice agents alone projected to reach $47.5 billion by 20342
.
Source: TechCrunch
The technical specifications of Voxtral TTS demonstrate Mistral's focus on efficiency and accessibility. Built on a 3.4-billion-parameter transformer decoder backbone with a 390-million-parameter flow-matching acoustic transformer and a 300-million-parameter neural audio codec, the text-to-speech model is roughly three times smaller than industry standards for comparable quality
2
. Pierre Stock, vice president of science operations at Mistral AI and the company's first employee, told TechCrunch that the model can fit on a smartwatch, smartphone, or laptop, with costs representing "a fraction of anything else on the market" while offering state-of-the-art performance1
. When quantized for inference, it requires roughly three gigabytes of RAM and can run on older hardware while maintaining real-time performance2
.Voxtral TTS supports nine languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic, with the ability to adapt custom voices using less than five seconds of reference audio
1
. The model captures subtle accents, inflections, intonations, and irregularities in speech flow, aiming for human-like voice generation rather than robotic output1
. Perhaps most notably, it demonstrates zero-shot cross-lingual voice adaptation without explicit training for that task2
. Stock illustrated this capability by explaining he can provide 10 seconds of his French-accented voice, type a prompt in German, and the model will generate German speech that sounds like him, complete with his natural vocal characteristics2
. This feature unlocks applications in dubbing, real-time translation, customer support, and sales for multinational organizations.
Source: VentureBeat
Related Stories
The model achieves a time-to-first-audio of 90 milliseconds for a 10-second sample of 500 characters and operates at a real-time factor of 6x, meaning it can render a 10-second clip in roughly 1.6 seconds
1
. This latency performance positions it for conversational voice agents in customer engagement scenarios. In human evaluations conducted by Mistral, Voxtral TTS achieved a 62.8 percent listener preference rate against ElevenLabs Flash v2.5 on flagship voices and a 69.9 percent preference rate in voice customization tasks2
. The company also claims performance at parity with ElevenLabs v3, directly challenging the ElevenLabs competitor narrative in the enterprise voice AI space2
.Voxtral TTS represents the latest component in Mistral's strategy to provide a complete AI stack for enterprises. Earlier this year, the company launched transcription models for both batch processing and real-time use cases with low latency
1
. Combined with its Forge customization platform announced at Nvidia GTC and AI Studio production infrastructure, the open-weight model completes a speech-to-speech pipeline that enterprises can run end-to-end without relying on external providers2
. Stock told VentureBeat that "we see audio as a big bet and as a critical and maybe the only future interface with all the AI models," adding that the company plans to develop an end-to-end platform handling multimodal streams of input including audio, text, and image1
2
. Mistral's positioning centers on the belief that open-source customization will drive enterprise adoption over competitors, as companies can tune the model to their specific requirements while maintaining control over model weights and avoiding API dependencies. Valued at $13.8 billion after a $2 billion Series C round led by Dutch chipmaker ASML last September, Mistral is betting that the future of enterprise voice AI will be determined not by who builds the best-sounding model, but by who gives companies the most control over it2
.Summarized by
Navi
16 Jul 2025•Technology

04 Feb 2026•Technology

08 May 2025•Technology

1
Technology

2
Entertainment and Society

3
Policy and Regulation
