2 Sources
[1]
You Can Now Try Out Gemini 2.5's Native Audio Dialog Generation
TTS in Gemini 2.5 Flash allows multi-speaker dialogue generation Google introduced new audio generation capabilities with the Gemini 2.5 models at the Google I/O 2025. The Mountain View-based tech giant is now letting developers and individuals test these features on its platform. The two new capabilities include native audio dialog and controllable text-to-speech (TTS) with Gemini 2.5 Flash preview. While the former can natively generate human-like audio while responding to user prompts, the latter can convert any script into conversational speech. These features are currently not available to developers via application programming interfaces (APIs). In a blog post, the tech giant detailed the features of these two audio generation modes, highlighting how developers can use them to build new experiences for people. Currently, native audio dialog can be tried out in Google AI Studio's stream tab, whereas the TTS feature can be tested in the generate media tab within AI Studio. Native audio dialog with Gemini 2.5 Flash preview is designed for real-time conversations between a human user and the AI. The user can either type a prompt or speak it, and the AI responds verbally. This process directly generates audio, instead of first generating text and then converting it into speech. There are several advantages to that as well. It supports affective dialog, which means when Gemini 2.5 Flash responds to the user's tone of voice, it can recognise the emotion behind the said words. It can understand when the user sounds scared, angry, or surprised and respond accordingly. Apart from this, the audio generation feature can express emotions when speaking, adopt different accents and linguistic styles, can access tools such as Google Search, and supports more than 24 languages. Coming to the controllable TTS feature, it offers multi-speaker dialogue generation, can produce emotions and accents while narrating the script, control delivery speed and emphasise pronunciation, and supports the same 24 languages and language mixing. Google says these capabilities were assessed for potential risks across the development process. The company used both internal mechanisms as well as red teaming to find and fix any vulnerabilities. The company also highlighted that all audio outputs from these models are embedded with SynthID, its watermarking technology.
[2]
Google expands Gemini 2.5 with native voice and TTS tools
At its I/O event, Google unveiled Gemini 2.5, an AI model with cutting-edge audio dialogue and generation capabilities. These enhancements aim to deliver seamless voice interactions across various products and languages globally. Google has integrated Gemini 2.5 into applications like NotebookLM's Audio Overviews and Project Astra. The model prioritizes real-time audio conversations, enabling AI to interpret and produce speech with natural tone, style, and contextual awareness. Gemini 2.5 offers advanced control over audio generation, allowing users to tailor speech output with precision: Google provides two Gemini 2.5 configurations for audio development: These configurations facilitate audio creation for applications such as podcasts, video games, and public announcements. Google conducted comprehensive risk evaluations during the development of Gemini 2.5's audio features. Safety measures were refined through internal and external testing, including red teaming. All AI-generated audio includes SynthID, Google's watermarking technology, to clearly identify AI-produced content. Google enables developers to utilize Gemini 2.5's audio capabilities via the Gemini API, accessible through Google AI Studio and Vertex AI environments.
Share
Copy Link
Google introduces native audio dialog and controllable text-to-speech features in Gemini 2.5, offering developers new tools for creating immersive AI-powered audio experiences.
Google has introduced groundbreaking audio generation features in its latest Gemini 2.5 model, showcased at the Google I/O 2025 event. These new capabilities, now available for testing by developers and individuals, mark a significant advancement in AI-powered audio interactions 1.
The native audio dialog feature in Gemini 2.5 Flash preview enables real-time conversations between users and AI. This innovative approach generates audio responses directly, bypassing the traditional text-to-speech conversion process. Key features include:
Source: NDTV Gadgets 360
Gemini 2.5's controllable TTS feature offers unprecedented control over audio output:
Google has prioritized safety and ethical considerations in developing these audio features:
The new audio capabilities of Gemini 2.5 have been integrated into various Google products:
While currently available for testing in Google AI Studio, these features are not yet accessible via APIs 1. However, Google plans to make Gemini 2.5's audio capabilities available through the Gemini API, accessible via Google AI Studio and Vertex AI environments 2.
This development opens up new possibilities for creating immersive AI-powered experiences across various domains, including podcasting, gaming, and public communications 2. As these technologies continue to evolve, they promise to revolutionize how we interact with AI systems and consume audio content.
Summarized by
Navi
[1]
Apple is reportedly in talks with OpenAI and Anthropic to potentially use their AI models to power an updated version of Siri, marking a significant shift in the company's AI strategy.
22 Sources
Technology
11 hrs ago
22 Sources
Technology
11 hrs ago
Microsoft unveils an AI-powered diagnostic system that demonstrates superior accuracy and cost-effectiveness compared to human physicians in diagnosing complex medical conditions.
6 Sources
Technology
19 hrs ago
6 Sources
Technology
19 hrs ago
Google announces a major expansion of AI tools in education, including Gemini for Education and NotebookLM for under-18 users, aiming to transform classroom experiences while addressing concerns about AI in learning environments.
7 Sources
Technology
11 hrs ago
7 Sources
Technology
11 hrs ago
NVIDIA's upcoming GB300 Blackwell Ultra AI servers, slated for release in the second half of 2025, are poised to become the most powerful AI servers globally. Major Taiwanese manufacturers are vying for production orders, with Foxconn securing the largest share.
2 Sources
Technology
3 hrs ago
2 Sources
Technology
3 hrs ago
Elon Musk's AI company, xAI, has raised $10 billion through a combination of debt and equity financing to expand its AI infrastructure and development efforts.
3 Sources
Business and Economy
3 hrs ago
3 Sources
Business and Economy
3 hrs ago