2 Sources
2 Sources
[1]
Google's Gemini 3.1 Flash TTS offers unparalleled control over AI voices - SiliconANGLE
Google's Gemini 3.1 Flash TTS offers unparalleled control over AI voices Google LLC's DeepMind artificial intelligence unit has rolled out a new text-to-speech model called Gemini 3.1 Flash TTS. Unlike its earlier, robotic predecessors, it enables users to direct the vocal style, delivery and pace of chatbot responses through text-based commands, the company said in a blog post. A video posted on X shows that Gemini 3.1 Flash TTS provides advanced options for controlling the voice projected by the model, with controls that can adapt its inflection and tone. Options include "enthusiastic," "positive surprise" and "informative." In addition, the model also allows users to select different regional accents of various major languages. English has a myriad of options to choose from, including American "Valley" and "Southern" accents, plus numerous British variants, including "Brixton" and "RP." There are other accents too, such as "Transatlantic." Another feature is Gemini 3.1 Flash TTS's director-level controls, which allow users to adjust the model's speaking style and pace. There are also format templates that users can choose from, including podcast conversation, audiobook narrator, language tutor, voice assistant, wellness guide, news broadcaster and support agent styles. Google said users will be able to "set the stage" by defining the environment and providing specific dialogue instructions, and that they'll be able to export these settings as application programming interface code. "This world-building context helps characters remain "in-character" and react to one another naturally across multiple turns," the company said in a blog post. "Once the performance is perfected, these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms." Google said the goal of Gemini 3.1 Flash TTS is to offer more natural-sounding speech experiences, and it's doing this in a huge variety of more than 70 languages, including Japanese, Hindi and German. The model also features SynthID watermarks on all of its outputs, so its content is easy to detect. On the Artificial Analysis TTS leaderboard, a benchmark that captures thousands of blind human preferences, Gemini 3.1 Flash TTS ranked second overall with a score of 1211, surpassing many other popular text-to-speech models.
[2]
Google Gemini 3.1 Flash TTS AI model is here: Capabilities, availability and other details
Google claims that Gemini 3.1 Flash TTS is its most natural and expressive model yet. Google has introduced a new text-to-speech AI model dubbed Gemini 3.1 Flash TTS. According to the tech giant, the new model delivers improved controllability, expressivity and quality. Google also claims that Gemini 3.1 Flash TTS is its most natural and expressive model yet. On the Artificial Analysis TTS leadboard, a benchmark that captures thousands of blind human preferences, the model achieved an Elo score of 1,211. Google says that Artificial Analysis has also positioned Gemini 3.1 Flash TTS within its 'most attractive quadrant' as the model balances performance with low cost. One of the biggest upgrades in Gemini 3.1 Flash TTS is improved speech controllability. Users can guide how the AI speaks using natural language instructions. The model also introduces audio tags, which allow users to adjust vocal delivery more precisely. You can control speaking speed, pace and delivery. 'By embedding natural language commands directly into the text input, you can steer AI-speech output with improved levels of granularity,' Google said. Another key feature is support for multi-speaker dialogue. Developers can create different characters with unique audio profiles. Gemini 3.1 Flash TTS also supports more than 70 languages. 'Gemini 3.1 Flash TTS delivers high-fidelity speech and more precise control across more than 70 languages. These core optimisations bring advanced style, pacing and accent control to major markets,' the tech giant said. Also read: Google finally brings Gemini to Mac with dedicated app: All details Note that all audio generated by Gemini 3.1 Flash TTS includes a SynthID watermark. This invisible watermark is embedded in the audio and helps detect AI-generated content. Also read: Apple threatens to remove Elon Musk's Grok from App Store, leaked letter reveals Developers can access Gemini 3.1 Flash TTS in preview through the Gemini API and Google AI Studio. Enterprise users can use the model in preview through Vertex AI. Workspace users can access the new model via Google Vids.
Share
Share
Copy Link
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that lets users direct vocal style, delivery, and pace through natural language commands. Scoring 1,211 on the Artificial Analysis TTS leaderboard, the model supports over 70 languages with regional accents and includes SynthID watermarks for detecting AI-generated content.
Google DeepMind has introduced Gemini 3.1 Flash TTS, a text-to-speech model that marks a significant shift from robotic predecessors by enabling unparalleled control over AI voices
1
. The model allows users to guide AI speech with natural language instructions, directing vocal style, delivery, and pace through simple text-based commands. Google claims this is its most natural and expressive model yet, designed to deliver natural-sounding speech experiences across a wide range of applications2
.
Source: SiliconANGLE
On the Artificial Analysis TTS leaderboard, which captures thousands of blind human preferences, Flash TTS achieved an Elo score of 1,211, ranking second overall and surpassing many popular text-to-speech models
1
. According to Google, Artificial Analysis positioned the model within its 'most attractive quadrant' because it balances performance with low cost2
.One of the standout features is the model's director-level controls for speaking style, which allow users to adjust inflection and tone with options including "enthusiastic," "positive surprise," and "informative"
1
. The model introduces audio tags that enable users to adjust vocal delivery more precisely, controlling speaking speed and pace with improved levels of granularity2
.Flash TTS supports different regional accents across various major languages, with English offering options like American "Valley" and "Southern" accents, plus British variants including "Brixton" and "RP," as well as "Transatlantic"
1
. The model delivers high-fidelity speech across more than 70 languages, including Japanese, Hindi, and German, bringing advanced style, pacing, and accent control to major markets2
.The text-to-speech model includes format templates that users can choose from, such as podcast conversation, audiobook narrator, language tutor, voice assistant, wellness guide, news broadcaster, and support agent styles
1
. Users can "set the stage" by defining the environment and providing specific dialogue instructions, with the ability to export these settings as application programming interface code.Another key capability is support for multi-speaker dialogues, allowing developers to create different characters with unique audio profiles
2
. According to Google, this world-building context helps characters remain "in-character" and react to one another naturally across multiple turns . Once perfected, these parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms.Related Stories
All audio generated by Flash TTS includes a SynthID watermark embedded in the output, making AI-generated content easy to detect
1
. This invisible watermark helps address concerns about transparency in AI-generated audio2
.Developers can access the model in preview through the Gemini API and Google AI Studio, while enterprise users can utilize it via Vertex AI
2
. Workspace users can also access Flash TTS through Google Vids. The model's improved controllability, expressivity, and quality position it as a competitive option for developers seeking advanced text-to-speech capabilities with fine-grained control and broad language support.Summarized by
Navi
[1]
04 Jun 2025•Technology

12 Dec 2025•Technology

03 Mar 2026•Technology

1
Technology

2
Policy and Regulation

3
Policy and Regulation
