Google Launches More Expressive Gemini 3.1 Flash TTS With 70+ Languages

A broader push into programmable voice

Google is widening its generative audio offering with the release of Gemini 3.1 Flash text-to-speech, a new model the company describes as its most natural and expressive speech system so far. The update, reported by The Decoder, focuses on controllability as much as raw voice quality, giving developers more direct ways to shape how generated speech sounds.

The headline feature is a system of audio tags: text commands that let users steer style, tempo, tone, and accent. That matters because one of the long-running problems in text-to-speech is not simply making audio sound realistic, but making it reliably expressive in ways that map to product needs. Assistants, narrated explainers, customer-service flows, educational content, and dialog-heavy applications all benefit from different pacing and vocal styles.

By exposing those controls as simple text instructions, Google appears to be lowering the friction between prompt design and voice output. Instead of treating tone and delivery as opaque model behavior, the platform is presenting them as parameters developers can intentionally influence.

Language breadth and multi-speaker support

According to the report, Gemini 3.1 Flash TTS supports more than 70 languages and can generate multi-speaker dialogs. Those two capabilities make the model relevant not only for English-language demos but also for global products and more complex media workflows.

Language coverage is increasingly a competitive differentiator in AI voice. Many applications need one model family that can serve multiple markets without forcing teams to assemble a patchwork of region-specific providers. Multi-speaker dialog support is similarly useful because it opens the door to richer formats such as conversational lessons, dramatized narration, and synthetic host exchanges for short-form media.

The combination suggests Google is aiming at both developer tooling and enterprise deployment rather than a narrow consumer demo strategy. Availability through the Gemini API, Vertex AI for enterprise users, Google Vids for Workspace users, and AI Studio for free experimentation reinforces that point. The product is being positioned across prototyping and production channels at the same time.

Anthropic bans AI tools during job interviews to see how candidates actually think

Anthropic bars AI tools in interviews to test candidates

Anthropic reportedly prohibits AI assistance in live job interviews unless explicitly allowed, as the company tries to evaluate how applicants reason on their own.

Read article

Pricing and data-use split between free and paid tiers

The model’s economics are also explicit. The Decoder reports a free tier, with the caveat that Google uses free-tier data to improve its products. The paid tier is priced at $1.00 per million tokens for text input and $20.00 per million tokens for audio output, while batch mode cuts those costs in half to $0.50 and $10.00 respectively. On the paid tier, Google does not use the data for product improvement.

That split is significant because it mirrors a broader pattern across AI infrastructure: low-friction testing for experimentation, and clearer data-treatment boundaries for commercial use. For many developers, especially those working on customer-facing or regulated products, data-use terms can matter as much as benchmark performance.

The pricing model also suggests Google is competing on value as well as capability. Text-to-speech is now crowded with specialized voice startups and large cloud incumbents, so cost-performance balance can be decisive for adoption.

How it is being benchmarked

The report cites Artificial Analysis, where Gemini 3.1 Flash TTS is said to hold an Elo rating of 1,211. It also says the model outperforms ElevenLabs v3 in overall quality and trails only Inworld 1.5 Max. Whether those standings hold over time, the inclusion of benchmark context matters because the voice market has matured beyond novelty. Buyers increasingly expect measurable comparisons on quality, latency, controllability, and price.

Google’s emphasis on quality-to-price ratio appears designed to answer that market. A model that is near the top of the rankings while remaining aggressively priced becomes easier to justify for large-scale deployments, especially where audio output volumes are high.

AI Models Split Recipe Logic From Flavor Chemistry

New research from Kaikaku.AI argues that food recommendation systems should distinguish between ingredients that appear together in recipes and those that are chemically similar.

Read article

Watermarking as part of the release

Every generated audio file from the model is tagged with Google’s SynthID watermark, according to the report. That is an important implementation detail in a period when synthetic media governance is becoming a practical product issue rather than an abstract ethics discussion.

Watermarking does not eliminate misuse concerns, but it does show that provenance is being built into the release architecture. For enterprise customers and platform operators, that can be a meaningful signal that Google expects voice generation to scale into environments where authenticity and disclosure will matter.

A more competitive AI voice stack

The broader significance of this release is that it strengthens Google’s position in multimodal AI by making voice output more programmable, more multilingual, and more accessible across its product ecosystem. Text generation alone is no longer enough for many applications. Teams increasingly want text, image, video, and audio capabilities that can be orchestrated together.

Gemini 3.1 Flash TTS looks designed for that environment. The model’s expressive controls, broad language support, multi-speaker capability, preview availability, and pricing structure all point toward a practical deployment story rather than a research-only announcement.

Whether it becomes the default choice for developers will depend on real-world testing, but the release makes one thing clear: the race in generative AI voice is no longer just about sounding human. It is about precision, integration, economics, and trust features arriving in one package.

This article is based on reporting by The Decoder. Read the original article.

Banner showing how MISUMI Americas allows developers and manufacturers to buy, configure, and customize robot parts.

MISUMI launches Americas push with $1B AI manufacturing bet

MISUMI has launched MISUMI Americas as part of a $1 billion investment plan, pairing its precision parts business with Fictiv’s AI-powered digital manufacturing platform.

Read article

Originally published on the-decoder.com