2 Sources
[1]
You Can Now Try Out Gemini 2.5's Native Audio Dialog Generation
TTS in Gemini 2.5 Flash allows multi-speaker dialogue generation Google introduced new audio generation capabilities with the Gemini 2.5 models at the Google I/O 2025. The Mountain View-based tech giant is now letting developers and individuals test these features on its platform. The two new capabilities include native audio dialog and controllable text-to-speech (TTS) with Gemini 2.5 Flash preview. While the former can natively generate human-like audio while responding to user prompts, the latter can convert any script into conversational speech. These features are currently not available to developers via application programming interfaces (APIs). In a blog post, the tech giant detailed the features of these two audio generation modes, highlighting how developers can use them to build new experiences for people. Currently, native audio dialog can be tried out in Google AI Studio's stream tab, whereas the TTS feature can be tested in the generate media tab within AI Studio. Native audio dialog with Gemini 2.5 Flash preview is designed for real-time conversations between a human user and the AI. The user can either type a prompt or speak it, and the AI responds verbally. This process directly generates audio, instead of first generating text and then converting it into speech. There are several advantages to that as well. It supports affective dialog, which means when Gemini 2.5 Flash responds to the user's tone of voice, it can recognise the emotion behind the said words. It can understand when the user sounds scared, angry, or surprised and respond accordingly. Apart from this, the audio generation feature can express emotions when speaking, adopt different accents and linguistic styles, can access tools such as Google Search, and supports more than 24 languages. Coming to the controllable TTS feature, it offers multi-speaker dialogue generation, can produce emotions and accents while narrating the script, control delivery speed and emphasise pronunciation, and supports the same 24 languages and language mixing. Google says these capabilities were assessed for potential risks across the development process. The company used both internal mechanisms as well as red teaming to find and fix any vulnerabilities. The company also highlighted that all audio outputs from these models are embedded with SynthID, its watermarking technology.
[2]
Google expands Gemini 2.5 with native voice and TTS tools
At its I/O event, Google unveiled Gemini 2.5, an AI model with cutting-edge audio dialogue and generation capabilities. These enhancements aim to deliver seamless voice interactions across various products and languages globally. Google has integrated Gemini 2.5 into applications like NotebookLM's Audio Overviews and Project Astra. The model prioritizes real-time audio conversations, enabling AI to interpret and produce speech with natural tone, style, and contextual awareness. Gemini 2.5 offers advanced control over audio generation, allowing users to tailor speech output with precision: Google provides two Gemini 2.5 configurations for audio development: These configurations facilitate audio creation for applications such as podcasts, video games, and public announcements. Google conducted comprehensive risk evaluations during the development of Gemini 2.5's audio features. Safety measures were refined through internal and external testing, including red teaming. All AI-generated audio includes SynthID, Google's watermarking technology, to clearly identify AI-produced content. Google enables developers to utilize Gemini 2.5's audio capabilities via the Gemini API, accessible through Google AI Studio and Vertex AI environments.
Share
Copy Link
Google introduces native audio dialog and controllable text-to-speech features in Gemini 2.5, offering developers new tools for creating immersive AI-powered audio experiences.
Google has introduced groundbreaking audio generation features in its latest Gemini 2.5 model, showcased at the Google I/O 2025 event. These new capabilities, now available for testing by developers and individuals, mark a significant advancement in AI-powered audio interactions 1.
The native audio dialog feature in Gemini 2.5 Flash preview enables real-time conversations between users and AI. This innovative approach generates audio responses directly, bypassing the traditional text-to-speech conversion process. Key features include:
Source: NDTV Gadgets 360
Gemini 2.5's controllable TTS feature offers unprecedented control over audio output:
Google has prioritized safety and ethical considerations in developing these audio features:
The new audio capabilities of Gemini 2.5 have been integrated into various Google products:
While currently available for testing in Google AI Studio, these features are not yet accessible via APIs 1. However, Google plans to make Gemini 2.5's audio capabilities available through the Gemini API, accessible via Google AI Studio and Vertex AI environments 2.
This development opens up new possibilities for creating immersive AI-powered experiences across various domains, including podcasting, gaming, and public communications 2. As these technologies continue to evolve, they promise to revolutionize how we interact with AI systems and consume audio content.
Summarized by
Navi
[1]
OpenAI CEO Sam Altman reveals Meta's aggressive recruitment tactics, offering $100 million signing bonuses to poach AI talent. Despite the lucrative offers, Altman claims no top researchers have left OpenAI for Meta.
34 Sources
Business and Economy
20 hrs ago
34 Sources
Business and Economy
20 hrs ago
YouTube announces integration of Google's advanced Veo 3 AI video generator into Shorts format, potentially revolutionizing content creation and raising questions about the future of user-generated content.
7 Sources
Technology
3 hrs ago
7 Sources
Technology
3 hrs ago
Pope Leo XIV, the first American pope, has made artificial intelligence's threat to humanity a key issue of his papacy, calling for global regulation and challenging tech giants' influence on the Vatican.
3 Sources
Policy and Regulation
4 hrs ago
3 Sources
Policy and Regulation
4 hrs ago
Google introduces Search Live, an AI-powered feature enabling back-and-forth voice conversations with its search engine, enhancing user interaction and multitasking capabilities.
11 Sources
Technology
3 hrs ago
11 Sources
Technology
3 hrs ago
OpenAI CEO Sam Altman announces GPT-5's summer release, hinting at significant advancements and potential shifts in AI model deployment. Meanwhile, OpenAI renegotiates with Microsoft and expands into new markets.
2 Sources
Technology
3 hrs ago
2 Sources
Technology
3 hrs ago