A broader push into programmable voice
Google is widening its generative audio offering with the release of Gemini 3.1 Flash text-to-speech, a new model the company describes as its most natural and expressive speech system so far. The update, reported by The Decoder, focuses on controllability as much as raw voice quality, giving developers more direct ways to shape how generated speech sounds.
The headline feature is a system of audio tags: text commands that let users steer style, tempo, tone, and accent. That matters because one of the long-running problems in text-to-speech is not simply making audio sound realistic, but making it reliably expressive in ways that map to product needs. Assistants, narrated explainers, customer-service flows, educational content, and dialog-heavy applications all benefit from different pacing and vocal styles.
By exposing those controls as simple text instructions, Google appears to be lowering the friction between prompt design and voice output. Instead of treating tone and delivery as opaque model behavior, the platform is presenting them as parameters developers can intentionally influence.
Language breadth and multi-speaker support
According to the report, Gemini 3.1 Flash TTS supports more than 70 languages and can generate multi-speaker dialogs. Those two capabilities make the model relevant not only for English-language demos but also for global products and more complex media workflows.
Language coverage is increasingly a competitive differentiator in AI voice. Many applications need one model family that can serve multiple markets without forcing teams to assemble a patchwork of region-specific providers. Multi-speaker dialog support is similarly useful because it opens the door to richer formats such as conversational lessons, dramatized narration, and synthetic host exchanges for short-form media.
The combination suggests Google is aiming at both developer tooling and enterprise deployment rather than a narrow consumer demo strategy. Availability through the Gemini API, Vertex AI for enterprise users, Google Vids for Workspace users, and AI Studio for free experimentation reinforces that point. The product is being positioned across prototyping and production channels at the same time.
Pricing and data-use split between free and paid tiers
The model’s economics are also explicit. The Decoder reports a free tier, with the caveat that Google uses free-tier data to improve its products. The paid tier is priced at $1.00 per million tokens for text input and $20.00 per million tokens for audio output, while batch mode cuts those costs in half to $0.50 and $10.00 respectively. On the paid tier, Google does not use the data for product improvement.
That split is significant because it mirrors a broader pattern across AI infrastructure: low-friction testing for experimentation, and clearer data-treatment boundaries for commercial use. For many developers, especially those working on customer-facing or regulated products, data-use terms can matter as much as benchmark performance.
The pricing model also suggests Google is competing on value as well as capability. Text-to-speech is now crowded with specialized voice startups and large cloud incumbents, so cost-performance balance can be decisive for adoption.
How it is being benchmarked
The report cites Artificial Analysis, where Gemini 3.1 Flash TTS is said to hold an Elo rating of 1,211. It also says the model outperforms ElevenLabs v3 in overall quality and trails only Inworld 1.5 Max. Whether those standings hold over time, the inclusion of benchmark context matters because the voice market has matured beyond novelty. Buyers increasingly expect measurable comparisons on quality, latency, controllability, and price.
Google’s emphasis on quality-to-price ratio appears designed to answer that market. A model that is near the top of the rankings while remaining aggressively priced becomes easier to justify for large-scale deployments, especially where audio output volumes are high.
Watermarking as part of the release
Every generated audio file from the model is tagged with Google’s SynthID watermark, according to the report. That is an important implementation detail in a period when synthetic media governance is becoming a practical product issue rather than an abstract ethics discussion.
Watermarking does not eliminate misuse concerns, but it does show that provenance is being built into the release architecture. For enterprise customers and platform operators, that can be a meaningful signal that Google expects voice generation to scale into environments where authenticity and disclosure will matter.
A more competitive AI voice stack
The broader significance of this release is that it strengthens Google’s position in multimodal AI by making voice output more programmable, more multilingual, and more accessible across its product ecosystem. Text generation alone is no longer enough for many applications. Teams increasingly want text, image, video, and audio capabilities that can be orchestrated together.
Gemini 3.1 Flash TTS looks designed for that environment. The model’s expressive controls, broad language support, multi-speaker capability, preview availability, and pricing structure all point toward a practical deployment story rather than a research-only announcement.
Whether it becomes the default choice for developers will depend on real-world testing, but the release makes one thing clear: the race in generative AI voice is no longer just about sounding human. It is about precision, integration, economics, and trust features arriving in one package.
This article is based on reporting by The Decoder. Read the original article.
Originally published on the-decoder.com








