Curated by THEOUTPOST
On Wed, 23 Apr, 12:05 AM UTC
2 Sources
[1]
Two undergrads built an AI speech model to rival NotebookLM | TechCrunch
A pair of undergrads, neither with extensive AI expertise, say that they've created an openly available AI model that can generate podcast-style clips similar to Google's NotebookLM. The market for synthetic speech tools is vast and growing. ElevenLabs is one of the largest players, but there's no shortage of challengers (see PlayAI, Sesame, and so on). Investors believe that these tools have immense potential. According to PitchBook, startups developing voice AI tech raised over $398 million in VC funding last year. Toby Kim, one of the Korea-based co-founders of Nari Labs, the group behind the newly released model, said that he and his fellow co-founder started learning about speech AI three months ago. Inspired by NotebookLM, they wanted to create a model that offered more control over generated voices and "freedom in the script." Kim says they used Google's TPU Research Cloud program, which provides researchers with free access to the company's TPU AI chips, to train Nari's model, Dia. Weighing in at 1.6 billion parameters, Dia can generate dialogue from a script, letting users customize speakers' tones and insert disfluencies, coughs, laughs, and other nonverbal cues. Parameters are the internal variables models use to make predictions. Generally, models with more parameters perform better. Available from the AI dev platform Hugging Face and GitHub, Dia can run on most modern PCs with at least 10GB of VRAM. It generates a random voice unless prompted with a description of an intended style, but it can also clone a person's voice. In TechCrunch's brief testing of Dia through Nari's web demo, Dia worked quite well, uncomplaining generating two-way chats about any subject. The quality of the voices seems competitive with other tools out there, and the voice cloning function is among the easiest this reporter has tried. Here's a sample: Like many voice generators, Dia offers little in the way of safeguards, however. It'd be trivially easy to craft disinformation or a scammy recording. On Dia's project pages, Nari discourages abuse of the model to impersonate, deceive, or otherwise engage in illicit campaigns, but the group says it "isn't responsible" for misuse. Nari also hasn't disclosed which data it scraped to train Dia. It's possible Dia was developed using copyrighted content -- a commenter on Hacker News notes that one sample sounds like the hosts of NPR's "Planet Money" podcast. Training models on copyrighted content is a widespread but legally dubious practice. Some AI companies claim that fair use shields them from liability, while rights holders assert that fair use doesn't apply to training. In any event, Kim says Nari's plan is to create a synthetic voice platform with a "social aspect" on top of Dia and larger, future models. Nari also intends to release a technical report for Dia, and to expand the model's support to languages beyond English.
[2]
A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More A two-person startup by the name of Nari Labs has introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts -- and one of its creators claims it surpasses the performance of competing proprietary offerings from the likes of ElevenLabs, Google's hit NotebookLM AI podcast generation product. It could also threaten uptake of OpenAI's recent gpt-4o-mini-tts. "Dia rivals NotebookLM's podcast feature while surpassing ElevenLabs Studio and Sesame's open model in quality," said Toby Kim, one of the co-creators of Nari and Dia, on a post from his account on the social network X. In a separate post, Kim noted that the model was built with "zero funding," and added across a thread: "...we were not AI experts from the beginning. It all started when we fell in love with NotebookLM's podcast feature when it was released last year. We wanted more -- more control over the voices, more freedom in the script. We tried every TTS API on the market. None of them sounded like real human conversation." Kim further credited Google for giving him and his collaborator access to the company's Tensor Processing Unit chips (TPUs) for training Dia through Google's Research Cloud. Dia's code and weights -- the internal model connection set -- is now available for download and local deployment by anyone from Hugging Face or Github. Individual users can try generating speech from it on a Hugging Face Space. Advanced controls and more customizable features Dia supports nuanced features like emotional tone, speaker tagging, and nonverbal audio cues -- all from plain text. Users can mark speaker turns with tags like [S1] and [S2], and include cues like (laughs), (coughs), or (clears throat) to enrich the resulting dialogue with nonverbal behaviors. These tags are correctly interpreted by Dia during generation -- something not reliably supported by other available models, according to the company's examples page. The model is currently English-only and not tied to any single speaker's voice, producing different voices per run unless users fix the generation seed or provide an audio prompt. Audio conditioning, or voice cloning, lets users guide speech tone and voice likeness by uploading a sample clip. Nari Labs offers example code to facilitate this process and a Gradio-based demo so users can try it without setup. Comparison with ElevenLabs and Sesame Nari offers a host of example audio files generated by Dia on its Notion website, comparing it to other leading speech-to-text rivals, specifically ElevenLabs Studio and Sesame CSM-1B, the latter a new text-to-speech model from Oculus VR headset co-creator Brendan Iribe that went somewhat viral on X earlier this year. Side-by-side examples shared by Nari Labs show how Dia outperforms the competition in several areas: In standard dialogue scenarios, Dia handles both natural timing and nonverbal expressions better. For example, in a script ending with (laughs), Dia interprets and delivers actual laughter, whereas ElevenLabs and Sesame output textual substitutions like "haha". For example, here's Dia... ...and the same sentence spoken by ElevenLabs Studio In multi-turn conversations with emotional range, Dia demonstrates smoother transitions and tone shifts. One test included a dramatic, emotionally-charged emergency scene. Dia rendered the urgency and speaker stress effectively, while competing models often flattened delivery or lost pacing. Dia uniquely handles nonverbal-only scripts, such as a humorous exchange involving coughs, sniffs, and laughs. Competing models failed to recognize these tags or skipped them entirely. Even with rhythmically complex content like rap lyrics, Dia generates fluid, performance-style speech that maintains tempo. This contrasts with more monotone or disjointed outputs from ElevenLabs and Sesame's 1B model. Using audio prompts, Dia can extend or continue a speaker's voice style into new lines. An example using a conversational clip as a seed showed how Dia carried vocal traits from the sample through the rest of the scripted dialogue. This feature isn't robustly supported in other models. In one set of tests, Nari Labs noted that Sesame's best website demo likely used an internal 8B version of the model rather than the public 1B checkpoint, resulting in a gap between advertised and actual performance. The model runs on PyTorch 2.0+ and CUDA 12.6 and requires about 10GB of VRAM. Inference on enterprise-grade GPUs like the NVIDIA A4000 delivers roughly 40 tokens per second. While the current version only runs on GPU, Nari plans to offer CPU support and a quantized release to improve accessibility. The startup offers both a Python library and CLI tool to further streamline deployment. Dia's flexibility opens use cases from content creation to assistive technologies and synthetic voiceovers. Nari Labs is also developing a consumer version of Dia aimed at casual users looking to remix or share generated conversations. Interested users can sing up via email to a waitlist for early access. Fully open source The model is distributed under a fully open source Apache 2.0 license, which means it can be used for commercial purposes -- something that will obviously appeal to enterprises or indie app developers. Nari Labs explicitly prohibits usage that includes impersonating individuals, spreading misinformation, or engaging in illegal activities. The team encourages responsible experimentation and has taken a stance against unethical deployment. Dia's development credits support from the Google TPU Research Cloud, Hugging Face's ZeroGPU grant program, and prior work on SoundStorm, Parakeet, and Descript Audio Codec. Nari Labs itself comprises just two engineers -- one full-time and one part-time -- but they actively invite community contributions through its Discord server and GitHub. With a clear focus on expressive quality, reproducibility, and open access, Dia adds a distinctive new voice to the landscape of generative speech models.
Share
Share
Copy Link
Two undergraduate students with limited AI expertise have developed Dia, an open-source AI speech model that challenges established players like Google's NotebookLM and ElevenLabs.
In a surprising turn of events, two undergraduate students with limited AI expertise have developed an open-source AI speech model that rivals industry giants. Toby Kim and his co-founder, operating under the name Nari Labs, have created Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue from text prompts 12.
Dia offers advanced features that set it apart from existing models:
The model runs on PyTorch 2.0 and CUDA 12.0, requiring about 10GB of VRAM. It can generate approximately 40 tokens per second on enterprise-grade GPUs like the NVIDIA A4000 2.
The creators of Dia leveraged Google's TPU Research Cloud program, which provided free access to the company's TPU AI chips for training. This resource was crucial in enabling the undergraduates to compete with well-funded companies in the AI space 1.
Nari Labs claims that Dia outperforms competing proprietary offerings from ElevenLabs, Google's NotebookLM, and potentially even OpenAI's recent gpt-4-0-mini-tts 2. The company provides side-by-side comparisons on their website, demonstrating Dia's superior handling of:
Dia is fully open-source, distributed under the Apache 2.0 license, allowing for commercial use. The model is available for download from Hugging Face and GitHub, and can run on most modern PCs with at least 10GB of VRAM 12.
The flexibility of Dia opens up various use cases, including:
Nari Labs is developing a consumer version of Dia for casual users interested in remixing or sharing generated conversations. They also plan to release a technical report and expand language support beyond English 12.
While Dia offers impressive capabilities, it also raises concerns about potential misuse. The model currently lacks robust safeguards against the creation of disinformation or scam recordings. Nari Labs discourages abuse but states they are not responsible for misuse 1.
Additionally, questions arise about the data used to train Dia, as it may include copyrighted content. This issue reflects a broader debate in the AI industry about the legality and ethics of training models on copyrighted materials 1.
As Dia enters the market, it represents both the democratization of AI technology and the need for careful consideration of its implications and responsible deployment in the rapidly evolving field of synthetic speech.
Deepgram launches Aura-2, a new text-to-speech AI model designed for enterprise use, outperforming competitors in blind tests and offering cost-effective, high-quality voice solutions for business applications.
2 Sources
2 Sources
Google's NotebookLM, an AI-powered study tool, has gained viral attention for its Audio Overview feature, which creates engaging AI-generated podcasts from various content sources.
5 Sources
5 Sources
OpenAI introduces new AI models for speech-to-text and text-to-speech, offering improved accuracy, customization, and potential for building AI agents with voice capabilities.
7 Sources
7 Sources
Hume AI launches Octave, an innovative text-to-speech system powered by a large language model, capable of generating contextually aware and emotionally nuanced speech for various applications.
5 Sources
5 Sources
Sesame, the startup behind the viral virtual assistant Maya, has released its base AI model CSM-1B for public use. While this move promotes innovation, it also raises ethical concerns about potential misuse of voice cloning technology.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved