FeatureGoogleVerified

Google adds real-time streaming to Gemini API speech generation with gemini-3.1-flash-tts-preview

No audio yetJun 17, 2026published Jun 17, 2026

Google updated the Gemini API on June 17, 2026 to support streaming for text-to-speech output via the gemini-3.1-flash-tts-preview model. Developers can now receive audio chunks as they are generated rather than waiting for the full response—a meaningful change for voice agents and real-time speech applications.

What's new

The Gemini API changelog entry for June 17 states that streaming is now available for speech generation:

REST API: use streamGenerateContent to stream audio output
Interactions API: pass stream: true to receive audio incrementally
Supported model: gemini-3.1-flash-tts-preview
Full implementation details in the updated Text-to-Speech guide

Before this update, the speech generation endpoint only returned audio once the complete response was ready, requiring developers to buffer the entire result before playback could begin.

Context

Google has been expanding Gemini's multimodal output capabilities throughout 2026. The gemini-3.1-flash-tts-preview model is part of a family of speech-capable models that Google has made available through the Gemini API. Its "flash" designation suggests optimization for speed and cost over the highest-fidelity options.

This update arrives alongside a wave of deprecations on the image and video side. On June 15, Google announced the retirement of Imagen 4 generation models (imagen-4.0-generate-001, the ultra and fast variants) effective August 17, 2026, and the shutdown of Veo 2 and Veo 3 generation models on June 30, 2026. The Veo deprecations are replaced by veo-3.1-generate-preview and veo-3.1-fast-generate-preview, plus GA models available through the Gemini Enterprise Agent Platform.

The pattern suggests Google is actively pruning older model snapshots while extending capabilities on models it's committing to long-term.

Why it matters

For voice assistant and voice agent developers, latency is the critical user experience metric. When a user asks a voice AI a question, they expect audio to begin almost immediately—not after a multi-second wait for the full generation to buffer. Streaming TTS collapses that wait: audio playback can begin as soon as the first chunks are ready, with the rest following in real time.

The gemini-3.1-flash-tts-preview streaming addition brings the Gemini API's speech capabilities to parity with what developers expect from production-grade voice pipelines. It removes a friction point that would have forced developers to either accept latency or engineer workarounds. For teams building voice-first agents, customer service bots, or real-time language learning tools on the Gemini stack, this is a direct quality-of-life upgrade.

Corroborating sources

Changelog
https://ai.google.dev/gemini-api/docs/changelog
“Streaming via `streamGenerateContent` (and `stream: true` in the Interactions API) is now supported for the `gemini-3.1-flash-tts-preview` model.”

What's new

The Gemini API changelog entry for June 17 states that streaming is now available for speech generation:

REST API: use streamGenerateContent to stream audio output

Interactions API: pass stream: true to receive audio incrementally

Supported model: gemini-3.1-flash-tts-preview

Full implementation details in the updated Text-to-Speech guide

Before this update, the speech generation endpoint only returned audio once the complete response was ready, requiring developers to buffer the entire result before playback could begin.

Context

The pattern suggests Google is actively pruning older model snapshots while extending capabilities on models it's committing to long-term.

Why it matters