3 Sources
3 Sources
[1]
A New Mistral AI Model's Ultra-Fast Translation Gives Big AI Labs a Run for Their Money
Mistral AI has released a new family of AI models that it claims will clear the path to seamless conversation between people speaking different languages. On Wednesday, the Paris-based AI lab released two new speech-to-text models: Voxtral Mini Transcribe V2 and Voxtral Realtime. The former is built to transcribe audio files in large batches and the latter for nearly real-time transcription, within 200 milliseconds; both can translate between 13 languages. Voxtral Realtime is freely available under an open source license. At four billion parameters, the models are small enough to run locally on a phone or laptop -- a first in the speech-to-text field, Mistral claims -- meaning that private conversations needn't be dispatched to the cloud. According to Mistral, the new models are both cheaper to run and less error-prone than competing alternatives. Mistral has pitched Voxtral Realtime -- though the model outputs text, not speech -- as a marked step towards free-flowing conversation across the language barrier, a problem Apple and Google are also competing to solve. The latest model from Google is able to translate at a two-second delay. "What we are building is a system to be able to seamlessly translate. This model is basically laying the groundwork for that," claims Pierre Stock, VP of Science Operations at Mistral, in an interview with WIRED. "I think this problem will be solved in 2026." Founded in 2023 by Meta and Google DeepMind alumni, Mistral is one of few European companies developing foundational AI models capable of running remotely close to the American market leaders -- OpenAI, Anthropic, and Google -- from a capability standpoint. Without access to the same level of funding and compute, Mistral has focused on eking out performance through imaginative model design and careful optimization of training datasets. The aim is that micro-improvements across all aspects of model development translate into material performance gains. "Frankly, too many GPUs makes you lazy," claims Stock. "You just blindly test a lot of things, but you don't think what's the shortest path to success." Mistral's flagship large language model (LLM) does not match competing models developed by US competitors for raw capability. But the company has carved out a market by striking a compromise between price and performance. "Mistral offers an alternative that is more cost efficient, where the models are not as big, but they're good enough, and they can be shared openly," says Annabelle Gawer, director at the Centre of Digital Economy at the University of Surrey. "It might not be a Formula One car, but it's a very efficient family car." Meanwhile, as its American counterparts throw hundreds of billions of dollars at the race to artificial general intelligence, Mistral is building a roster of specialist -- albeit less sexy -- models meant to perform narrow tasks, like converting speech into text. "Mistral does not position itself as a niche player, but it is certainly creating specialized models," says Gawer. "As a US player with resources, you want to have a very powerful general-purpose technology. You don't want to waste your resources fine tuning it to the languages and specificities of certain sectors or geographies. You leave this kind of less profitable business on the table, which creates room for middle players." As the relationship between the US and its European allies shows signs of deterioration, Mistral has leant increasingly into its European roots too. "There is a trend in Europe where companies and in particular governments are looking very carefully at their dependency on US software and AI firms," says Dan Bieler, principal analyst at IT consulting firm PAC. Against that backdrop, Mistral has positioned itself as the safest pair of hands: a European-native, multilingual, open source alternative to the proprietary models developed in the US. "Their question has always been: How do we build a defensible position in a market that is dominated by hugely financed American actors?" says Raphaëlle D'Ornano, founder of tech advisory firm D'Ornano + Co. "The approach Mistral has taken so far is that they want to be the sovereign alternative, compliant with all the regulations that may exist within the EU." Though the performance gap to the American heavyweights will remain, as businesses contend with the need to find a return on AI investment and factor in the geopolitical context, smaller models tuned to industry- and region-specific requirements will have their day, Bieler predicts. "The LLMs are the giants dominating the discussions, but I wouldn't count on this being the situation forever," claims Bieler. "Small and more regionally focused models will play a much larger role going forward."
[2]
These New AI Transcription Models Are Built for Speed and Privacy
Expertise Artificial intelligence, home energy, heating and cooling, home technology. Sometimes you want to transcribe something, but don't want it to be hanging out on the internet for any hacker to see. Maybe it's a conversation with your doctor or lawyer. Maybe you're a journalist, and it's a sensitive interview. Privacy and control are important. That desire for privacy is one reason the French developer Mistral AI built its latest transcription models to be small enough to run on devices. They can run on your phone, on your laptop or in the cloud. Voxtral Mini Transcribe 2, one of the new models announced Wednesday, is "super, super small," Pierre Stock, Mistral's vice president of science operations, told me. Another new model, Voxtral Realtime, can do the same thing but live, like closed captioning. Privacy is not the only reason the company wanted to build small open-source models. By running right on the device you're using, these models can work faster. No more waiting on files to find their way through the internet to a data center and back. "What you want is the transcription to happen super, super close to you," Stock said. "And the closest we can find to you is any edge device, so a laptop, a phone, a wearable like a smartwatch, for instance." The low latency (read: high speed) is especially important for real-time transcription. The Voxtral Realtime model can generate with a latency of less than 200 milliseconds, Stock said. It can transcribe a speaker's words about as quickly as you can read them. No more waiting two or three seconds for the closed captioning to catch up. The Voxtral Realtime model is available through Mistral's API and on Hugging Face, along with a demo where you can try it out. In some brief testing, I found it generated fairly quickly (although not as fast as you'd expect if it were on device) and managed to capture what I said accurately in English with a little bit of Spanish mixed in. It's capable of handling 13 languages right now, according to Mistral. Voxtral Mini Transcribe 2 is also available through the company's API, or you can play around with it in Mistral's AI Studio. I used the model to transcribe my interview with Stock. I found it to be quick and pretty reliable, although it struggled with proper names like Mistral AI (which it called Mr. Lay Eye) and Voxtral (VoxTroll). Yes, the AI model got its own name wrong. But Stock said users can customize the model to understand certain words, names and jargon better if they're using it for specific tasks. The challenge of building small, fast AI models is that they also have to be accurate, Stock said. The company touted the models' performance on benchmarks showing improved error rates compared to competitors. "It's not enough to say, OK, I'll make a small model," Stock said. "What you need is a small model that has the same quality as larger models, right?"
[3]
Mistral drops Voxtral Transcribe 2, an open-source speech model that runs on-device for pennies
Mistral AI, the Paris-based startup positioning itself as Europe's answer to OpenAI, released a pair of speech-to-text models on Wednesday that the company says can transcribe audio faster, more accurately, and far more cheaply than anything else on the market -- all while running entirely on a smartphone or laptop. The announcement marks the latest salvo in an increasingly competitive battle over voice AI, a technology that enterprise customers see as essential for everything from automated customer service to real-time translation. But unlike offerings from American tech giants, Mistral's new Voxtral Transcribe 2 models are designed to process sensitive audio without ever transmitting it to remote servers -- a feature that could prove decisive for companies in regulated industries like healthcare, finance, and defense. "You'd like your voice and the transcription of your voice to stay close to where you are, meaning you want it to happen on device -- on a laptop, a phone, or a smartwatch," Pierre Stock, Mistral's vice president of science operations, said in an interview with VentureBeat. "We make that possible because the model is only 4 billion parameters. It's small enough to fit almost anywhere." Mistral released two distinct models under the Voxtral Transcribe 2 banner, each engineered for different use cases. The Realtime model ships under an Apache 2.0 open-source license, meaning developers can download the model weights from Hugging Face, modify them, and deploy them without paying Mistral a licensing fee. For companies that prefer not to run their own infrastructure, API access costs $0.006 per minute. Stock said Mistral is betting on the open-source community to expand the model's reach. "The open-source community is very imaginative when it comes to applications," he said. "We're excited to see what they're going to do." The decision to engineer models small enough to run locally reflects a calculation about where the enterprise market is heading. As companies integrate AI into ever more sensitive workflows -- transcribing medical consultations, financial advisory calls, legal depositions -- the question of where that data travels has become a dealbreaker. Stock painted a vivid picture of the problem during his interview. Current note-taking applications with audio capabilities, he explained, often pick up ambient noise in problematic ways: "It might pick up the lyrics of the music in the background. It might pick up another conversation. It might hallucinate from a background noise." Mistral invested heavily in training data curation and model architecture to address these issues. "All of that, we spend a lot of time ironing out the data and the way we train the model to robustify it," Stock said. The company also added enterprise-specific features that its American competitors have been slower to implement. Context biasing allows customers to upload a list of specialized terminology -- medical jargon, proprietary product names, industry acronyms -- and the model will automatically favor those terms when transcribing ambiguous audio. Unlike fine-tuning, which requires retraining the model, context biasing works through a simple API parameter. "You only need a text list," Stock explained. "And then the model will automatically bias the transcription toward these acronyms or these weird words. And it's zero shots, no need for retraining, no need for weird stuff." Stock described two scenarios that capture how Mistral envisions the technology being deployed. The first involves industrial auditing. Imagine technicians walking through a manufacturing facility, inspecting heavy machinery while shouting observations over the din of factory noise. "In the end, imagine like a perfect timestamped notes identifying who said what -- so diarization -- while being super robust," Stock said. The challenge is handling what he called "weird technical language that no one is able to spell except these people." The second scenario targets customer service operations. When a caller contacts a support center, Voxtral Realtime can transcribe the conversation in real time, feeding text to backend systems that pull up relevant customer records before the caller finishes explaining the problem. "The status will appear for the operator on the screen before the customer stops the sentence and stops complaining," Stock explained. "Which means you can just interact and say, 'Okay, I can see the status. Let me correct the address and send back the shipment.'" He estimated this could reduce typical customer service interactions from multiple back-and-forth exchanges to just two interactions: the customer explains the problem, and the agent resolves it immediately. For all the focus on transcription, Stock made clear that Mistral views these models as foundational technology for a more ambitious goal: real-time speech-to-speech translation that feels natural. "Maybe the end goal application and what the model is laying the groundwork for is live translation," he said. "I speak French, you speak English. It's key to have minimal latency, because otherwise you don't build empathy. Your face is not out of sync with what you said one second ago." That goal puts Mistral in direct competition with Apple and Google, both of which have been racing to solve the same problem. Google's latest translation model operates at a two-second delay -- ten times slower than what Mistral claims for Voxtral Realtime. Mistral occupies an unusual position in the AI landscape. Founded in 2023 by alumni of Meta and Google DeepMind, the company has raised over $2 billion and now carries a valuation of approximately $13.6 billion. Yet it operates with a fraction of the compute resources available to American hyperscalers -- and has built its strategy around efficiency rather than brute force. "The models we release are enterprise grade, industry leading, efficient -- in particular, in terms of cost -- can be embedded into the edge, unlocks privacy, unlocks control, transparency," Stock said. That approach has resonated particularly with European customers wary of dependence on American technology. In January, France's Ministry of the Armed Forces signed a framework agreement giving the country's military access to Mistral's AI models -- a deal that explicitly requires deployment on French-controlled infrastructure. "I think a big barrier to adoption of voice AI is that, hey, if you're in a sensitive industry like finance or in manufacturing or healthcare or insurance, you can't have information you're talking about just go to the cloud," Howard Cohen, who participated in the interview alongside Stock, noted. "It needs to be either on device or needs to be on your premise." The transcription market has grown fiercely competitive. OpenAI's Whisper model has become something of an industry standard, available both through API and as downloadable open-source weights. Google, Amazon, and Microsoft all offer enterprise-grade speech services. Specialized players like Assembly AI and Deepgram have built substantial businesses serving developers who need reliable, scalable transcription. Mistral claims its new models outperform all of them on accuracy benchmarks while undercutting them on price. "We are better than them on the benchmarks," Stock said. Independent verification of those claims will take time, but the company points to performance on FLEURS, a widely used multilingual speech benchmark, where Voxtral models achieve word error rates competitive with or superior to alternatives from OpenAI and Google. Perhaps more significantly, Mistral's CEO Arthur Mensch has warned that American AI companies face pressure from an unexpected direction. Speaking at the World Economic Forum in Davos last month, Mensch dismissed the notion that Chinese AI lags behind the West as "a fairy tale." "The capabilities of China's open-source technology is probably stressing the CEOs in the US," he said. Stock predicted that 2026 would be "the year of note-taking" -- the moment when AI transcription becomes reliable enough that users trust it completely. "You need to trust the model, and the model basically cannot make any mistake, otherwise you would just lose trust in the product and stop using it," he said. "The threshold is super, super hard." Whether Mistral has crossed that threshold remains to be seen. Enterprise customers will be the ultimate judges, and they tend to move slowly, testing claims against reality before committing budgets and workflows to new technology. The audio playground in Mistral Studio, where developers can test Voxtral Transcribe 2 with their own files, went live today. But Stock's broader argument deserves attention. In a market where American giants compete by throwing billions of dollars at ever-larger models, Mistral is making a different wager: that in the age of AI, smaller and local might beat bigger and distant. For the executives who spend their days worrying about data sovereignty, regulatory compliance, and vendor lock-in, that pitch may prove more compelling than any benchmark. The race to dominate enterprise voice AI is no longer just about who builds the most powerful model. It's about who builds the model you're willing to let listen.
Share
Share
Copy Link
Paris-based Mistral AI launched two new speech-to-text models that transcribe audio locally on phones and laptops without cloud transmission. Voxtral Realtime delivers transcription within 200 milliseconds across 13 languages, while Voxtral Mini Transcribe V2 handles batch processing. The 4-billion-parameter models cost just $0.006 per minute via API.
Mistral AI released two new AI transcription models on Wednesday that mark a shift in how voice technology balances speed, privacy, and cost. The Paris-based startup introduced Voxtral Mini Transcribe V2 for batch audio transcription and Voxtral Realtime for near-instantaneous transcription, both capable of handling 13 languages
1
. At just 4 billion parameters, these speech-to-text models are small enough to run locally on phones or laptops—a capability Mistral AI claims is a first in the field1
.
Source: VentureBeat
The Voxtral Realtime model operates with latency under 200 milliseconds, generating transcriptions nearly as quickly as someone can read them
2
. This ultra-fast translation capability positions Mistral to compete directly with tech giants like Google, whose latest model translates at a two-second delay1
. Pierre Stock, VP of Science Operations at Mistral AI, told WIRED that the company is building toward seamless conversation across language barriers, predicting this challenge "will be solved in 2026"1
.The ability to process audio locally addresses growing concerns about data sovereignty and privacy in sensitive contexts. By running on an edge device rather than transmitting data to remote servers, Voxtral keeps conversations—whether with doctors, lawyers, or journalists—from exposure to potential security breaches
2
3
. "You'd like your voice and the transcription of your voice to stay close to where you are," Stock explained to VentureBeat3
.
Source: CNET
This architecture proves particularly valuable for regulated industries like healthcare, finance, and defense, where data transmission rules can make cloud-based solutions impractical
3
. The compact design also delivers speed advantages—processing happens "super, super close to you" on devices like laptops, phones, or smartwatches, eliminating delays from internet transmission2
.Voxtral Realtime ships under an Apache 2.0 open source license, allowing developers to download model weights from Hugging Face, modify them, and deploy without licensing fees
3
. For companies preferring managed infrastructure, API access costs just $0.006 per minute—dramatically cheaper than competing alternatives3
. Mistral AI claims the new models are both more cost-efficient and less error-prone than existing options1
.The company added enterprise features like context biasing, which allows customers to upload specialized terminology through a simple API parameter without retraining the model
3
. "You only need a text list," Stock noted, "and then the model will automatically bias the transcription toward these acronyms or these weird words"3
. This zero-shot capability addresses challenges in sectors with proprietary jargon—from medical consultations to industrial auditing.Related Stories
Founded in 2023 by Meta and Google DeepMind alumni, Mistral AI positions itself as Europe's answer to OpenAI, Anthropic, and Google
1
. Without access to comparable funding and compute resources, the company focuses on performance gains through careful optimization rather than brute-force scaling. "Frankly, too many GPUs makes you lazy," Stock claimed1
.As US-European relations show strain, Mistral has leaned into its European roots as a multilingual, regulation-compliant alternative to proprietary American models
1
. Dan Bieler, principal analyst at PAC, notes companies and governments are scrutinizing dependency on US software and AI firms1
. Annabelle Gawer, director at the Centre of Digital Economy at the University of Surrey, describes Mistral's approach: "It might not be a Formula One car, but it's a very efficient family car"1
.Stock envisions Voxtral as foundational technology for natural real-time speech-to-speech translation
3
. Use cases span from customer service—where agents could resolve issues in two interactions instead of prolonged back-and-forth exchanges—to industrial settings where technicians shout observations over factory noise3
.
Source: Wired
Both models are available through Mistral's API and on Hugging Face, with a demo for testing Voxtral Realtime
2
. While the models handled English transcription accurately in testing, they struggled with proper names—including misspelling "Voxtral" itself—though Stock notes users can customize the model for specific terminology2
. As businesses seek returns on AI investment while navigating geopolitical complexities, analysts predict smaller models tuned to regional and industry requirements will capture growing market share against the American heavyweights1
.Summarized by
Navi
16 Jul 2025•Technology

17 Oct 2024•Technology

02 Dec 2025•Technology

1
Business and Economy

2
Policy and Regulation

3
Policy and Regulation
