Curated by THEOUTPOST
On Thu, 27 Feb, 8:03 AM UTC
4 Sources
[1]
ElevenLabs' new speech-to-text model claims 97% accuracy
ElevenLabs, an AI startup recognized for its audio-generation capabilities, has launched a stand-alone speech-to-text model named Scribe. The launch follows a substantial $180 million funding round, elevating the company's valuation to $3.3 billion. Scribe supports over 99 languages and achieves a word error rate of less than 5% in over 25 languages, including English, which has a claimed accuracy rate of 97%. Other languages in the excellent accuracy category include French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Additional languages are classified with varying error rates from high (5% to 10%) to moderate (25% to 50%). Video: ElevenLabs The new model reportedly outperforms Google's Gemini 2.0 Flash and OpenAI's Whisper Large v3 in multiple languages based on FLEURS and Common Voice benchmark tests. Scribe is the first separate speech detection model from ElevenLabs, which had previously integrated speech-to-text components into its AI conversational agent platform. ChatGPT Plus subscribers now enjoy deep research feature CEO Mati Staniszewski highlighted the goal of enhancing understanding of conversations: "We are working on ways to move away from only generating content and understanding and transcribing speech," he said. The model features speaker diarization, word-level timestamps for accurate subtitles, and auto-tagging of non-verbal audio events. Scribe is currently limited to pre-recorded audio formats, with a real-time version expected to be released soon. The pricing for Scribe is $0.40 per hour of transcribed audio, with an introductory 50% discount available for the first six weeks. Benchmark tests indicate Scribe records the lowest word error rates for various languages, achieving 98.7% in Italian and 96.7% in English. Key features include the ability to differentiate speakers in multi-speaker recordings, detailed timestamps, and the detection of non-speech events. For enterprise users, Scribe serves as a scalable transcription tool, beneficial for sectors that rely on documentation, meeting transcriptions, and accessibility initiatives. The forthcoming real-time version could further enhance its utility in live communication scenarios. The launch of Scribe coincided with the release of Hume AI's Octave, a customizable, LLM-powered text-to-speech model tailored for content creation. ElevenLabs claims Scribe has consistently outperformed competitors in transcription accuracy. Scribe can be accessed directly through the ElevenLabs website or API, allowing users to upload audio or video files for formatted transcripts. Its structured output aids integration into various applications, presenting a competitive option for businesses seeking high-accuracy transcription services.
[2]
ElevenLabs' new speech-to-text model Scribe is here with highest accuracy rate so far (96.7% for English)
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More ElevenLabs, the highly-valued AI voice cloning and generation startup from former Palantir alumni, today launched Scribe v1, a new speech-to-text model that reportedly achieves the highest accuracy across multiple languages. Users can try it here on the ElevenLabs site. According to the company's benchmarks, it outperforms Google's Gemini 2.0 Flash, OpenAI's Whisper v3, and Deepgram Nova-3 on accurately converting spoken speech into text on the web, achieving new record-low error rates. The company claims that Scribe delivers state-of-the-art transcription accuracy in 99 languages, including improved performance in previously underserved languages such as Serbian, Cantonese, and Malayalam. As Flavio Schneider, ElevenLabs Lead Researcher wrote on X, Scribe is the "smartest audio understanding model" released by ElevenLabs yet. "Scribe doesn't just transcribe -- it understands audio," Schneider continued in a threaded reply. "It can detect non-verbal events (like laughter, sound effects, music, and background noise) and analyze long audio contexts for accurate diarization, even in the most challenging environments." "Diarization" is the name given to processes of separating speakers by their vocal qualities on a recording. In fact, ElevenLabs' documentation states Scribe can distinguish and isolate up to 32 different speakers in the same audio file. While ElevenLabs cautions that Scribe is "best used for when high-accuracy transcription is required rather than real-time transcription," the company also plans to introduce a low-latency version soon, expanding its use for real-time applications. Lowest word error rates (WER) Scribe is designed to handle real-world audio challenges with precision. According to benchmark results from FLEURS and Common Voice, it records the lowest word error rates (WER) for many languages, including Italian (98.7%) and English (96.7%). Scribe is available now through the ElevenLabs website and API. Pricing is set at $0.40 per hour of input audio, with a 50% discount for the next six weeks. A low-latency version for real-time applications is also in development. What it means for enterprises For enterprise decision-makers, Scribe presents a tool for scalable, high-accuracy transcription, making it useful for industries relying on automated documentation, meeting transcription, and content accessibility. The model's ability to handle diverse languages with high precision also benefits multinational businesses, media companies, and customer support applications. Scribe's pricing structure makes it competitive for businesses that require high-volume transcription services, and its API-based integration allows for seamless adoption in enterprise workflows. Additionally, the upcoming low-latency version could position Scribe as a viable option for real-time communication tools. Coming the same day as rival Hume's opposite text-to-speech model Octave Timing is everything, and ElevenLabs chose to launch Scribe the same day as rival Hume AI unveiled Octave, an LLM-powered text-to-speech model that allows users to customize AI-generated voices with adjustable emotions. It is designed for content creation, including audiobooks, podcasts, and video game voiceovers. Unlike standard TTS systems, Octave considers context beyond individual sentences, adjusting tone, rhythm, and cadence dynamically to sound more natural. Hume AI positions Octave as a direct competitor to ElevenLabs' text-to-speech offerings, highlighting that Octave's pricing is about half the cost of ElevenLabs' current AI voice services. While Scribe and Octave serve different functions, their development reflects the growing competition in AI-driven audio models. ElevenLabs is prioritizing precise, multi-language speech recognition, while Hume AI is advancing expressive AI-generated speech. For enterprises, this means more specialized solutions for both transcription and synthetic voice applications, enabling more efficient content production, customer engagement, and accessibility tools. Scribe is now live, and ElevenLabs is hosting a virtual event next week with the team behind its development. More details, benchmarks, and API documentation are available in the official blog post.
[3]
ElevenLabs Unveils Scribe, a Speech-to-Text Transcription Model to Rival Otter, TurboScribe, and Others
ElevenLabs claims to have launched the most accurate speech-to-text and transcription model on the market. ElevenLabs has launched Scribe, a new speech-to-text tool that promises the highest accuracy in the field. This position positions the company among notable competitors like Google, Otter, Fireflies, and TurboScribe, all of which are established in speech-to-text technology. ElevenLabs is popularly known for its text-to-speech and AI voice generation technologies. With Scribe, the users get a product that does the opposite using their expertise in the speech synthesis field. Scribe transcribes speech in 99 languages, with features like word-level timestamps, speaker diarisation, and audio-event tagging. The transcription is aimed to be delivered as a structured response for seamless integration. For its accuracy, ElevenLabs states that they tested it using FLEURS and Common Voice benchmark tests across all supported languages and found that it consistently outperformed models like Gemini 2.0 Flash, Whisper Large V3, and Deepgram Nova-3. "Whether it's meeting summaries, movie subtitles, or even song lyrics, Scribe delivers the lowest automated transcription word error rate in Italian (98.7%), English (96.7%), and 97 other languages," said ElevenLabs. They emphasise that their technology addresses languages such as Serbian, Cantonese, and Malayalam with low word error rates. The developers can integrate Scribe using their Speech-to-Text API to get structured JSON transcripts with non-speech event markers, speaker diarisation, and word-level timestamps. Scribe is priced at $0.40 per hour of input audio, and for the next six weeks, it offers an extra introductory discount. If you are a creator or business, Scribe can be accessed directly via the ElevenLabs dashboard to upload audio or video files and generate formatted transcripts. Currently, the offering focuses on higher accuracy. A low-latency version of real-time applications will be released soon, according to ElevenLabs.
[4]
ElevenLabs is launching its own speech-to-text model | TechCrunch
ElevenLabs, an AI startup that just raised a $180 million mega funding round, has been primarily known for its audio generation prowess. The company took a step in another technological direction by launching its first standalone speech-to-text model called Scribe. The startup, valued at $3.3 billion, has aided many other companies in providing speech-to-text services through its vast library of voices. However, the company is now looking to get into speech detection and compete with the likes of Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI's Whisper models. ElevenLabs' Scribe model supports over 99 languages at launch. The company categorizes over 25 languages in excellent accuracy category for the model where the word error rate is less than 5%. This list includes English (claimed accuracy rate of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Other languages are ranked in different categories with high (5-10% word error rate), good (10 to 20% word error rate), and moderate (25 to 50%) word error rates. The company said that the model outperformed Google Gemini 2.0 Flash and Whisper Large V3 across multiple languages in FLEURS & Common Voice benchmark tests. ElevenLabs had developed the speech-to-text component for its AI conversational agent platform, which was released last year. However, this is the first time the company is releasing a standalone speech detection model. In a conversation with TechCrunch last month, CEO Mati Staniszewski talked about improving speech detection models. "We want to understand what's being said by you in a conversation better. We are working on ways to move away from only generating content and understanding and transcribing speech," Staniszewski said at that time. "Many people say that speech-to-text is a solved problem. But for many languages, it is pretty bad. We think we can build better speech detection models because we have in-house teams to annotate data and give us quick feedback." The model also has smart speaker diarization to tell you who is speaking, timestamp at word level for accurate subtitles, and auto-tagging sound events like audience laughters. The startup is providing a way for customers to directly transcribe video content to add subtitles or captions in its studio. Scribe currently only works with pre-recorded audio formats. The company said it will release a low-latency real-time version of the model soon. That means it is not yet effective for meeting transcriptions or voice note-taking. ElevenLabs is pricing Scribe at $0.40 for an hour of transcribed audio. While the rate is competitive, some of its rivals offer a lower price for audio transcriptions at the moment with some feature differentiation.
Share
Share
Copy Link
ElevenLabs, an AI startup valued at $3.3 billion, has introduced Scribe, a new speech-to-text model claiming 97% accuracy in English and support for over 99 languages, positioning itself as a strong competitor in the AI transcription market.
ElevenLabs, an AI startup known for its audio generation capabilities, has launched Scribe, a standalone speech-to-text model that claims to set new standards in transcription accuracy. This move comes on the heels of a substantial $180 million funding round that valued the company at $3.3 billion 12.
Scribe boasts support for over 99 languages, with a word error rate of less than 5% in more than 25 languages. The model claims a 97% accuracy rate for English, while languages such as Italian have achieved an impressive 98.7% accuracy 123. Other languages in the high-accuracy category include French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese 4.
According to ElevenLabs' benchmarks, Scribe has outperformed notable competitors such as Google's Gemini 2.0 Flash, OpenAI's Whisper Large v3, and Deepgram Nova-3 in FLEURS and Common Voice benchmark tests across multiple languages 23. This positions ElevenLabs as a formidable player in the speech-to-text market, challenging established names like Otter, Fireflies, and TurboScribe 3.
Scribe offers several sophisticated features that set it apart:
ElevenLabs has priced Scribe competitively at $0.40 per hour of transcribed audio, with an introductory 50% discount for the first six weeks 13. The model is accessible through the ElevenLabs website and API, allowing users to upload audio or video files for formatted transcripts 13.
While Scribe currently focuses on pre-recorded audio for high-accuracy transcription, ElevenLabs has announced plans to release a low-latency version for real-time applications in the near future 24. This development could significantly expand Scribe's utility in live communication scenarios and further disrupt the market.
For businesses, Scribe presents a powerful tool for scalable, high-accuracy transcription. Its multi-language support and advanced features make it particularly valuable for multinational corporations, media companies, and customer support applications 2. The competitive pricing and API-based integration also position Scribe as an attractive option for enterprises requiring high-volume transcription services 2.
As the AI-driven audio model market continues to evolve, ElevenLabs' launch of Scribe, alongside developments from competitors like Hume AI's Octave, signals a new era of specialized solutions for both transcription and synthetic voice applications 2. This progression promises to enhance content production, customer engagement, and accessibility tools across various industries.
Reference
[1]
[2]
[3]
ElevenLabs, a leading AI voice technology company, has raised $180 million in Series C funding, tripling its valuation to $3.3 billion. The company plans to use the funds to enhance its voice AI research, expand globally, and develop new products for digital interactions.
9 Sources
9 Sources
OpenAI introduces new AI models for speech-to-text and text-to-speech, offering improved accuracy, customization, and potential for building AI agents with voice capabilities.
7 Sources
7 Sources
ElevenLabs introduces a free AI-powered audiobook publishing tool, ElevenReader Publishing, allowing authors to easily convert books into audiobooks. The company also partners with Spotify to distribute AI-narrated audiobooks, potentially revolutionizing the audiobook industry.
12 Sources
12 Sources
Hume AI launches Octave, an innovative text-to-speech system powered by a large language model, capable of generating contextually aware and emotionally nuanced speech for various applications.
5 Sources
5 Sources
Gladia, a French AI startup, has secured $16 million in Series A funding to develop an advanced multilingual real-time audio transcription and analytics engine, aiming to revolutionize voice-first platforms across various industries.
4 Sources
4 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved