2 Sources
2 Sources
[1]
Microsoft's VibeVoice uses AI to create 90-minute podcasts with multiple speakers
TL;DR: Microsoft's open-source VibeVoice AI generates up to 90 minutes of multi-speaker, high-fidelity conversational audio using advanced text-to-speech technology. It leverages a Large Language Model and diffusion framework to maintain speaker consistency and natural dialogue flow, making it ideal for podcasts and long-form audio content. Microsoft's new open-source text-to-voice generative AI tool, VibeVoice, is an interesting one, as it can generate audio of up to 90 minutes in length with four distinct speakers. Naturally, with a script, VibeVoice becomes a viable tool for creating an audio podcast or other "expressive, long-form, multi-speaker conversational audio." With there already being quite a few AI-powered Text-to-Speech (TTS) systems and tools, what separates VibeVoice from the pack is its ability to maintain and preserve audio fidelity, speaker consistency, and "natural turn-taking" over an extended period. "VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details," the official description reads. VibeVoice offers a live demo for you to check out, along with the option to download it. As a pure Text-to-Speech (TTS) tool, VibeVoice requires a script to work, which you'll need to whip up yourself or use another AI tool like ChatGPT to generate. VibeVoice is available in multiple versions: a compact 1.5 billion-parameter model and a more complex 7 billion-parameter model. There's also a 0.5 billion parameter model on the way, designed for real-time audio generation. For those with a modern GPU, the 1.5 billion-parameter version requires approximately 7GB of VRAM, while the larger 7 billion-parameter model requires around 18GB. As for the quality of VibeVoice, the voices and conversation flow, although impressive, still sound very much like those of an AI. For more on VibeVoice, check out its GitHub repository and Hugging Face page.
[2]
Microsoft Unveils VibeVoice for Longer Conversational AI Audio | PYMNTS.com
By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions. But there are notable differences. Microsoft's text-to-speech model can generate four voices and up to 90 minutes of podcast-quality speech. NotebookLM can do two voices. Additionally, VibeVoice reads and organizes text while NotebookLM ingests documents and turns them into two-person podcasts. Users can also query and get document summaries, according to tech firm Hugging Face. That means VibeVoice doesn't try to understand the text but rather performs it audibly, ostensibly to replace a recording studio. VibeVoice is the latest offering in voice AI technology, which has been attracting venture capital funding. In 2024, voice AI startups raised $2.1 billion, up eightfold from the prior year, according to market research firm CB Insights. There's rising interest in voice shopping: A PYMNTS Intelligence report shows that 30.4% of Gen Z consumers already shop by voice every week, followed by millennials. For all ages, the average is 17.9% of consumers using voice to shop. VibeVoice runs on 1.5 billion parameters, relatively small for a model capable of sustaining dialogue across multiple speakers. It was trained using Alibaba's open-source Qwen2.5, a large language model that helps orchestrate natural turn-taking and contextually aware speech patterns during dialogues. Microsoft claims this means VibeVoice can produce fluid conversations among four voices and yet maintain each voice's distinct characteristics, even in longer conversations. See also: How the World Does Digital: A Deep Dive Into Global Digital Engagement Potential research applications of VibeVoice include the following: Recognizing the risks of deepfakes, Microsoft said VibeVoice's safeguards include ensuring every audio file includes both a disclaimer -- such as "This segment was generated by AI" -- and a hidden digital watermark. It bars impersonation, disinformation and live deepfake uses such as real-time voice conversion in calls. It supports only English and Chinese speech for now. The model is available for research, not commercial deployment. Read more: Nobody's Talking: Voice Interfaces Face Hurdles for Wide Adoption
Share
Share
Copy Link
Microsoft unveils VibeVoice, an open-source AI tool capable of creating 90-minute podcasts with multiple speakers, showcasing advancements in long-form conversational audio generation.
Microsoft has unveiled VibeVoice, an innovative open-source text-to-voice generative AI tool that pushes the boundaries of artificial intelligence in audio content creation. This cutting-edge technology can generate up to 90 minutes of high-fidelity conversational audio featuring multiple speakers, marking a significant advancement in the field of AI-powered audio generation
1
.Source: PYMNTS
VibeVoice stands out from existing Text-to-Speech (TTS) systems due to its ability to maintain audio fidelity, speaker consistency, and natural turn-taking over extended periods. The system employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to comprehend textual context and dialogue flow, while a diffusion head generates high-fidelity acoustic details
1
.The tool is available in multiple versions to cater to different computational needs:
1
VibeVoice's primary application lies in creating expressive, long-form, multi-speaker conversational audio content. It's particularly suited for generating podcast-like content, offering four distinct voices that can maintain their characteristics throughout lengthy dialogues
2
.The tool requires a script to function, which can be created manually or generated using other AI tools like ChatGPT. This flexibility opens up numerous possibilities for content creators and researchers in the field of conversational AI
1
.Related Stories
Recognizing the potential risks associated with deepfake technology, Microsoft has implemented several safeguards in VibeVoice:
2
Source: TweakTown
VibeVoice enters a rapidly growing market for voice AI technology. In 2024, voice AI startups raised $2.1 billion, an eightfold increase from the previous year. This surge in funding reflects the rising interest in voice-based technologies, particularly in areas like voice shopping
2
.A PYMNTS Intelligence report indicates that 30.4% of Gen Z consumers already shop by voice weekly, followed closely by millennials. Across all age groups, an average of 17.9% of consumers use voice for shopping
2
.While VibeVoice is currently available only for research purposes and not for commercial deployment, its introduction signals potential shifts in content creation, entertainment, and various industries relying on audio communication
2
.As AI-generated audio content becomes more sophisticated, it raises important questions about the future of human-created content and the potential impact on industries such as podcasting, audiobooks, and voice acting. The development of tools like VibeVoice underscores the need for ongoing discussions about the ethical use of AI in content creation and the importance of transparency in AI-generated media.
Summarized by
Navi