Curated by THEOUTPOST
On Wed, 26 Feb, 4:07 PM UTC
5 Sources
[1]
Hume AI just unveiled Octave -- new AI voice generator is eerily human
Hume AI today has unveiled Octave, an innovative text-to-speech (TTS) system that leverages large language model (LLM) technology to generate contextually aware and emotionally nuanced speech. The incredibly human-like voice tool competitively positions Octave as a leader in AI-driven voice synthesis. Traditional TTS systems often produce context-insensitive speech, which leads to monotonous output. However, Octave differentiates itself by comprehending the context of the text and then adding emotional undertones. The AI tool has the ability to adjust tone, rhythm, and cadence accordingly. The output results in speech that is more lifelike and engaging. For instance, Octave can interpret a sarcastic remark and deliver it with the appropriate intonation or convey urgency in a panicked sentence without explicit direction. One of Octave's standout features is its Voice Design capability. Users can create unique AI voices by providing descriptive prompts that specify characteristics such as accent, age, gender, and emotional tone. For example, prompting Octave with "a dramatic medieval knight" will generate a voice that embodies that persona. This functionality offers creators unparalleled flexibility in tailoring voices to fit specific narratives or character profiles. In an internal blind comparison study performed by Hume AI and not released to the public, 180 human raters favored Octave's outputs over those from ElevenLabs in terms of audio quality (71.6%), naturalness (51.7%), and alignment with desired voice descriptions (57.7%) across 120 diverse prompts. These results underscore Octave's ability to produce high-quality, natural-sounding speech that accurately reflects user specifications. Octave's advanced capabilities have broad implications across various industries. Content creators can utilize Octave to generate dynamic voiceovers for audiobooks, podcasts, and videos, enhancing listener engagement through expressive narration. In gaming, developers can craft immersive character dialogues that adapt to in-game contexts and player interactions. Additionally, Octave's potential extends to virtual assistants and customer service bots, enabling them to respond with appropriate emotional nuances, thereby improving user experience and satisfaction. While Octave represents a significant technological advancement, it also raises important ethical considerations. The ability to generate highly realistic and emotionally resonant speech necessitates responsible use to prevent potential misuse, such as deepfake audio or deceptive impersonations. Hume AI acknowledges these concerns and emphasizes the importance of implementing safeguards and ethical guidelines to ensure that Octave's deployment aligns with societal values and trust. Hume AI's Octave sets a new standard in text-to-speech technology by combining large language model intelligence with sophisticated voice synthesis. Its ability to understand and convey context and emotion opens new avenues for creating authentic and engaging auditory experiences across multiple domains. As AI continues to evolve, innovations like Octave highlight the potential for technology to bridge the gap between human expression and machine-generated communication.
[2]
Hume launches new text-to-speech model Octave that generates custom AI voices with adjustable emotions
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More New York City startup Hume AI emerged from stealth two years ago and has since raised multimillions in funding on the basis of its technology that creatives emotive AI voices for use in enterprise applications. Today, it is taking its offerings a step further with a new large-language and speech model called the "Omni-capable text and voice engine," or Octave for short, designed to produce lifelike, emotionally nuanced speech for use across different forms of content, from audiobooks to prerecorded video game character dialog and film/TV/video. Hume claims Octave the first text-to-speech system powered by a large language model (LLM) trained not only on text but on speech and emotion tokens, enabling it to understand words in context and adjust tone, rhythm, and cadence accordingly -- and which the user can adjust on the sentence-level with text prompts. "We're launching the first LLM for text-to-speech -- a model that understands words in context, predicting the right emotions, rhythm, cadence, and emphasis, making speech sound more human than ever before," said Alan Cowen, Hume AI's co-founder and CEO, in a video call interview with VentureBeat. Octave's capabilities go beyond basic voice generation. It can interpret character traits and style from a script alone, adjusting vocal inflections to match implied emotions. A sarcastic remark will be spoken sarcastically, a panicked sentence will sound urgent, and a whispered secret will be hushed -- all without needing explicit direction. In addition, if the user doesn't like the generated voice or wants to adjust it, they can do so granularly through natural language by simply typing in a text instruction to Octave, such as "happier, sadder, more frustrated, angrier, more sarcastic, more sincere," etc. "You can describe a character -- like a sarcastic medieval peasant -- and the model will instantly create that voice, adjusting emotions like anger, sadness, or happiness based on your instructions," Cowen added. "Voice modulation works at the sentence level, but you can also adjust parts of a sentence, instructing the model to convey nuanced emotions like slight frustration mixed with humor or exasperation." The model also considers context beyond individual sentences. "Unlike traditional models that process text word by word, our model considers entire paragraphs, capturing context to deliver more natural and emotionally accurate speech," he explained. While the current release focuses on English-language speech, Octave also supports Spanish and is expected to expand its language capabilities in the near future. Tailored for content creation Octave is tailored for content creators and media production, offering applications in audiobooks, podcasts, video game characters, and video voiceovers. "This new model is designed for offline text-to-speech -- perfect for audiobooks, podcasts, video voiceovers, and video game characters -- where creators need realistic, character-specific voices," Cowen explained. However, the user must access it through Hume's website either on its Projects page or through an application programming interface (API). The "offline" component refers to the fact this model is designed to produce discrete audio files that can be added to projects such as videos or audiobooks. It's not designed to carry on realtime conversation, though that could theoretically be allowed by piping in text queries to the website. Hume's API allows developers to make up to 50 requests of the new Octave model per minute, with a maximum text length of 5,000 characters and descriptions capped at 1,000 characters. Each request can generate up to five outputs, and the supported audio formats include MP3, WAV, and PCM. Hume's prior EVI series of models allows for streaming, realtime, back-and-forth interactions and remain available and will continue to be developed. Hume AI offers a subscription-based pricing model with tiers ranging from a free option to Creator, Creator Pro, and Enterprise plans. Here's a concise breakdown of the offerings: Altogether, Hume emphasized its Octave TTS pricing is around half the cost of competing AI voice creation startup ElevenLabs, showing the intensifying competition in the space of text-to-speech. In addition, Hume AI conducted a blind comparison study with 180 human raters to benchmark Octave against ElevenLabs. The results showed that Octave was preferred in terms of audio quality (71.6% of trials), naturalness (51.7% of trials), and how well the speech matched descriptions of the desired voice (57.7% of trials), across 120 diverse prompts. To further evaluate its performance, Hume AI has also launched the Expressive TTS Arena, a public benchmark designed to test how well AI models handle longer, expressive speech -- an area that previous TTS benchmarks have largely overlooked. 10s of trillions of language tokens Unlike traditional text-to-speech systems that rely on limited speech datasets, Octave TTS is built on an LLM trained on tens of trillions of language tokens. "Traditional text-to-speech models are trained on limited speech data, but ours is built on an LLM trained on tens of trillions of tokens, enabling it to reason, think, and infer emotions from text," Cowen said. The model was trained using millions of hours of public, long-form speech data and Hume AI's proprietary datasets of new voices recored by survey participants. "We collected data from people recording themselves through webcams, reacting naturally to videos, telling stories, and talking to others, including friends and family, to capture a wide range of emotional expressions," Cowen said. This extensive training allows the model to infer emotional context and follow detailed instructions, creating voices that match specific character descriptions and attributes. Consistent character voices and limitations Octave TTS maintains consistent character voices across long-form content. "With our platform, you can generate unique voices for each character in an audiobook -- like a middle-aged orc -- and maintain that character's voice throughout the story," Cowen said. This capability is supported by Hume AI's "Projects" page, which handles long-form content like audiobooks by automatically chunking text while preserving character consistency and context across chapters. Hume has technical guardrails built into its website and API prohibiting the creation of realistic children's voices and imitations of specific individuals, but other than that, it is open to use across a wide range of content and subject, including potentially not-safe-for-work scenes such as those in popular romance novels. "We give developers freedom, allowing content across a broad range of human experiences, though we restrict the creation of realistic children's voices and imitations of specific individuals," Cowen explained. In addition, Cowen said that the company could adjust these guardrails for specific clients upon request, such as a children's book publisher looking to create voices for children's audiobooks. Hume AI is working on a forthcoming Voice Cloning feature, which will allow users to replicate a voice from as little as five seconds of audio. The company is developing safeguards to ensure ethical use before rolling out the feature publicly. With its combination of contextual awareness, emotional expression, and character customization, Octave TTS aims to provide content creators with more control and flexibility, delivering voices that sound both realistic and emotionally engaging.
[3]
Hume's Octave Claims to Outperform ElevenLabs in Capturing Human-Like Emotions in AI Voices
The speech-language model can predict the tune, rhythm, and timbre of speech. Octave, short for Omni-Capable Text and Voice Engine, is an LLM developed by Hume AI tailored for text-to-speech tasks. This innovation comes at a time when ElevenLabs launched its new speech-to-text technology, Scribe. The company explained that the model not only reads words but also understands their context, which enables it to enhance AI voice capabilities. It generates voices from prompts, acts out characters, and takes instructions to tweak emotions and style. The speech-language model can predict the tune, rhythm, and timbre of speech. It can also detect the plot twists, emotional cues, and character traits from the script or prompt. The prompts can be nuanced, like requesting a "patient, empathetic counsellor with an AMSR voice", allowing for highly specific tonalities. Furthermore, the platform's 'Action Instructions' feature lets users tweak the emotion or style of an existing voice, such as asking it to "sound sarcastic". Hume recently organised a blind comparison study with 180 human raters. In the study, Octave's outputs were favoured over those generated by ElevenLabs' Voice Design in several key aspects. Notably, Octave outperformed in audio quality (71.6%), naturalness (51.7%), and in how well the speech matched the intended prompt (57.7%) across a diverse set of 120 prompts. While the voice cloning feature is not currently available, the company said it will soon be. The feature will allow users to clone a voice extracted from as little as five seconds of audio. Octave is available on Hume's official portal and through its API. Users can also access a voice library of over 40 premade voices and try out its project interface, which is in preview, to generate long-form content like audiobooks and podcasts. The model is focused on English-language speech presently, but can also speak Spanish. They plan to improve its capabilities for other languages soon. In addition to Octave, Hume AI has also introduced the Expressive TTS Arena, a public evaluation platform inspired by Hugging Face's TTS Arena.
[4]
Hume launches text-to-speech model Octave that generates emotive, adjustable AI voices on-demand based on your prompts
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More New York City startup Hume AI emerged from stealth two years ago and has since raised multimillions in funding on the basis of its technology that creatives emotive AI voices for use in enterprise applications. Today, it is taking its offerings a step further with a new large-language and speech model called the "Omni-capable text and voice engine," or Octave for short, designed to produce lifelike, emotionally nuanced speech for use across different forms of content, from audiobooks to prerecorded video game character dialog and film/TV/video. Hume claims Octave the first text-to-speech system powered by a large language model (LLM) trained not only on text but on speech and emotion tokens, enabling it to understand words in context and adjust tone, rhythm, and cadence accordingly -- and which the user can adjust on the sentence-level with text prompts. "We're launching the first LLM for text-to-speech -- a model that understands words in context, predicting the right emotions, rhythm, cadence, and emphasis, making speech sound more human than ever before," said Alan Cowen, Hume AI's co-founder and CEO, in a video call interview with VentureBeat. Octave's capabilities go beyond basic voice generation. It can interpret character traits and style from a script alone, adjusting vocal inflections to match implied emotions. A sarcastic remark will be spoken sarcastically, a panicked sentence will sound urgent, and a whispered secret will be hushed -- all without needing explicit direction. In addition, if the user doesn't like the generated voice or wants to adjust it, they can do so granularly through natural language by simply typing in a text instruction to Octave, such as "happier, sadder, more frustrated, angrier, more sarcastic, more sincere," etc. "You can describe a character -- like a sarcastic medieval peasant -- and the model will instantly create that voice, adjusting emotions like anger, sadness, or happiness based on your instructions," Cowen added. While the current release focuses on English-language speech, Octave also supports Spanish and is expected to expand its language capabilities in the near future. Tailored for content creation Octave is tailored for content creators and media production, offering applications in audiobooks, podcasts, video game characters, and video voiceovers. "This new model is designed for offline text-to-speech -- perfect for audiobooks, podcasts, video voiceovers, and video game characters -- where creators need realistic, character-specific voices," Cowen explained. However, the user must access it through Hume's website either on its Projects page or through an application programming interface (API). The "offline" component refers to the fact this model is designed to produce discrete audio files that can be added to projects such as videos or audiobooks. It's not designed to carry on realtime conversation, though that could theoretically be allowed by piping in text queries to the website. Hume's API allows developers to make up to 50 requests of the new Octave model per minute, with a maximum text length of 5,000 characters and descriptions capped at 1,000 characters. Each request can generate up to five outputs, and the supported audio formats include MP3, WAV, and PCM. Hume's prior EVI series of models allows for streaming, realtime, back-and-forth interactions and remain available and will continue to be developed. Hume AI offers a subscription-based pricing model with tiers ranging from a free option to Creator, Creator Pro, and Enterprise plans. Here's a concise breakdown of the offerings: Altogether, Hume emphasized its Octave TTS pricing is around half the cost of competing AI voice creation startup ElevenLabs, showing the intensifying competition in the space of text-to-speech. In addition, Hume AI conducted a blind comparison study with 180 human raters to benchmark Octave against ElevenLabs. The results showed that Octave was preferred in terms of audio quality (71.6% of trials), naturalness (51.7% of trials), and how well the speech matched descriptions of the desired voice (57.7% of trials), across 120 diverse prompts. To further evaluate its performance, Hume AI has also launched the Expressive TTS Arena, a public benchmark designed to test how well AI models handle longer, expressive speech -- an area that previous TTS benchmarks have largely overlooked. 10s of trillions of language tokens Unlike traditional text-to-speech systems that rely on limited speech datasets, Octave TTS is built on an LLM trained on tens of trillions of language tokens. "Traditional text-to-speech models are trained on limited speech data, but ours is built on an LLM trained on tens of trillions of tokens, enabling it to reason, think, and infer emotions from text," Cowen said. The model was trained using millions of hours of public, long-form speech data and Hume AI's proprietary datasets of new voices recored by survey participants. "We collected data from people recording themselves through webcams, reacting naturally to videos, telling stories, and talking to others, including friends and family, to capture a wide range of emotional expressions," Cowen said. This extensive training allows the model to infer emotional context and follow detailed instructions, creating voices that match specific character descriptions and attributes. The model, available today through Hume AI's platform and API, offers sentence-level emotional control, with some flexibility within sentences. "Voice modulation works at the sentence level, but you can also adjust parts of a sentence, instructing the model to convey nuanced emotions like slight frustration mixed with humor or exasperation," Cowen noted. The model also considers context beyond individual sentences. "Unlike traditional models that process text word by word, our model considers entire paragraphs, capturing context to deliver more natural and emotionally accurate speech," he explained. Consistent character voices and limitations Octave TTS maintains consistent character voices across long-form content. "With our platform, you can generate unique voices for each character in an audiobook -- like a middle-aged orc -- and maintain that character's voice throughout the story," Cowen said. This capability is supported by Hume AI's "Projects" page, which handles long-form content like audiobooks by automatically chunking text while preserving character consistency and context across chapters. Hume has technical guardrails built into its website and API prohibiting the creation of realistic children's voices and imitations of specific individuals, but other than that, it is open to use across a wide range of content and subject, including potentially not-safe-for-work scenes such as those in popular romance novels. "We give developers freedom, allowing content across a broad range of human experiences, though we restrict the creation of realistic children's voices and imitations of specific individuals," Cowen explained. In addition, Cowen said that the company could adjust these guardrails for specific clients upon request, such as a children's book publisher looking to create voices for children's audiobooks. Additionally, Hume AI is working on a forthcoming Voice Cloning feature, which will allow users to replicate a voice from as little as five seconds of audio. The company is developing safeguards to ensure ethical use before rolling out the feature publicly. With its combination of contextual awareness, emotional expression, and character customization, Octave TTS aims to provide content creators with more control and flexibility, delivering voices that sound both realistic and emotionally engaging.
[5]
This new text-to-speech AI model understands what it's saying - how to try it for free
I tested Hume's new Octave model and was impressed with the results. Now you can try it, too. Text-to-speech AI models are a great tool for instances where human voice actors are typically used, such as audiobooks, dubbing, commercials, and more. However, because these models are not human and unaware of what they say, they can sometimes sound noticeably robotic. Hume's new AI model seeks to tackle this issue. Also: 10 key reasons AI went mainstream overnight - and what happens next On Wednesday, Hume launched Octave, a text-to-speech large language model (LLM) with contextual awareness. The LLM can use this awareness to adjust its tune, rhythm, and timbre of speech to the words it is reading based on their meaning, according to the company. For example, an AI-enabled voice can convey a sense of disgust when reading a sentence. Beyond understanding the context of the text, the model can also take directions. Users can instruct it to be "calm", "whispering", "disgustful", "angry", and more. Hume says the advantage Octave has over a voice actor is that it can take on any voice or even invent a new one based on the user description. Also: Why Anthropic's latest Claude model could be the new AI to beat - and how to try it For instance, Hume says a user could provide a prompt as simple as "wise wizard" or as complex as combining different accents, demographic groups, occupational roles, and more. Essentially, the model would invent a voice on the script alone, but when prompted, it could be steered by the script and the description. The user interface is easy to navigate, with one text box for Voice, in which you can describe exactly what you want the voice to sound like, and another for Script, in which you enter what you want the model to say. For my first test, I used the detailed pre-made prompts to see how it sounded. After clicking on "Generate", Octave generated three voice results, and upon first listen I was impressed. Although I wasn't convinced that the generations captured the "valley girl" sound, I was super-impressed with the intonations and inflections. For my prompt, I created a scenario where the primary speaker is out of breath from running and in a hurry. The script read: "YAY I am almost at the finish line. I am so tired but am going to keep pushing because I am almost there. See you later! Byeeee." Also: 3 easy side hustles OpenAI's Operator just made possible - plus how you can get started I was equally happy with these results. Octave mostly conveyed what I wanted, placing the right amount of excitement and pauses where breaths would be taken if you were exhausted from running. However, like the prior example, the voice wasn't exactly what I described. In this case, the speaker didn't speak super-fast. Overall, it seems like the model's strength is placing the nuances of human speech in its output. What often gives AI voices away is their monotony, making the output sound quite boring to listen to. With Octave, you could hear the reader's emotions, whether frustration, defeat, or tiredness. Words like "ugh" have the exact length and breathing a human would use, creating an engaging experience. There are different tiers for accessing the model, including a free one with a 10,000-character limit (around 10 minutes) and unlimited character voices if you want to try it out. Beyond the free tier, there are six additional tiers, ranging from $3 to $900 per month, depending on access needs. Also: Anthropic offers $20,000 to whoever can jailbreak its new AI safety system For example, the Starter tier is $3 per month and includes 30,000 characters (around 30 minutes), while the Business tier is $900 monthly for 10,000,000 characters (around 10,000 minutes). There is also an Enterprise option that can be customized to your needs. You can view all the offerings and get started on the Hume website.
Share
Share
Copy Link
Hume AI launches Octave, an innovative text-to-speech system powered by a large language model, capable of generating contextually aware and emotionally nuanced speech for various applications.
Hume AI, a New York City-based startup, has unveiled Octave, a groundbreaking text-to-speech (TTS) system that promises to revolutionize AI-driven voice synthesis. Octave, short for "Omni-capable text and voice engine," leverages large language model (LLM) technology to generate contextually aware and emotionally nuanced speech 12.
Octave distinguishes itself from traditional TTS systems by its ability to comprehend the context of the text and add appropriate emotional undertones. The AI tool can adjust tone, rhythm, and cadence accordingly, resulting in more lifelike and engaging speech 1.
One of Octave's standout features is its Voice Design capability. Users can create unique AI voices by providing descriptive prompts specifying characteristics such as accent, age, gender, and emotional tone. For instance, prompting Octave with "a dramatic medieval knight" will generate a voice embodying that persona 12.
Octave's capabilities go beyond basic voice generation. It can interpret character traits and style from a script alone, adjusting vocal inflections to match implied emotions. A sarcastic remark will be spoken sarcastically, a panicked sentence will sound urgent, and a whispered secret will be hushed – all without needing explicit direction 24.
Unlike traditional TTS systems that rely on limited speech datasets, Octave is built on an LLM trained on tens of trillions of language tokens. This extensive training allows the model to infer emotional context and follow detailed instructions, creating voices that match specific character descriptions and attributes 24.
In a blind comparison study conducted by Hume AI, 180 human raters favored Octave's outputs over those from ElevenLabs in terms of audio quality (71.6%), naturalness (51.7%), and alignment with desired voice descriptions (57.7%) across 120 diverse prompts 123.
Octave's advanced capabilities have broad implications across various industries. Content creators can utilize Octave to generate dynamic voiceovers for audiobooks, podcasts, and videos. In gaming, developers can craft immersive character dialogues that adapt to in-game contexts and player interactions 12.
Octave is available through Hume's website and API. The company offers a subscription-based pricing model with tiers ranging from a free option to Creator, Creator Pro, and Enterprise plans. Hume emphasizes that its Octave TTS pricing is around half the cost of competing AI voice creation startup ElevenLabs 245.
While Octave represents a significant technological advancement, it also raises important ethical considerations. The ability to generate highly realistic and emotionally resonant speech necessitates responsible use to prevent potential misuse, such as deepfake audio or deceptive impersonations 1.
As AI continues to evolve, innovations like Octave highlight the potential for technology to bridge the gap between human expression and machine-generated communication, setting a new standard in text-to-speech technology 123.
Reference
[2]
[3]
Analytics India Magazine
|Hume's Octave Claims to Outperform ElevenLabs in Capturing Human-Like Emotions in AI Voices[4]
Hume AI launches Voice Control, an innovative tool allowing users to create custom AI voices by adjusting 10 distinct vocal dimensions, offering a new level of personalization in voice AI technology.
2 Sources
2 Sources
Sesame AI's new Conversational Speech Model (CSM) introduces Maya and Miles, AI-generated voices that blur the line between human and machine interaction, sparking both excitement and concern.
10 Sources
10 Sources
OpenAI has finally released its advanced voice feature for ChatGPT Plus and Team users, allowing for more natural conversations with the AI. The feature was initially paused due to concerns over potential misuse.
14 Sources
14 Sources
ChatGPT's new Advanced Voice Mode brings human-like speech to AI interactions, offering multilingual support, customization, and diverse applications across personal and professional domains.
2 Sources
2 Sources
Google's NotebookLM, an AI-powered study tool, has gained viral attention for its Audio Overview feature, which creates engaging AI-generated podcasts from various content sources.
5 Sources
5 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved