Google adds real-time streaming to Gemini API speech generation with gemini-3.1-flash-tts-preview
Google updated the Gemini API on June 17, 2026 to support streaming for text-to-speech output via the gemini-3.1-flash-tts-preview model. Developers can now receive audio chunks as they are generated rather than waiting for the full response—a meaningful change for voice agents and real-time speech applications.
What's new
The Gemini API changelog entry for June 17 states that streaming is now available for speech generation:
- REST API: use
streamGenerateContentto stream audio output - Interactions API: pass
stream: trueto receive audio incrementally - Supported model:
gemini-3.1-flash-tts-preview - Full implementation details in the updated Text-to-Speech guide
Before this update, the speech generation endpoint only returned audio once the complete response was ready, requiring developers to buffer the entire result before playback could begin.
Context
Google has been expanding Gemini's multimodal output capabilities throughout 2026. The gemini-3.1-flash-tts-preview model is part of a family of speech-capable models that Google has made available through the Gemini API. Its "flash" designation suggests optimization for speed and cost over the highest-fidelity options.
This update arrives alongside a wave of deprecations on the image and video side. On June 15, Google announced the retirement of Imagen 4 generation models (imagen-4.0-generate-001, the ultra and fast variants) effective August 17, 2026, and the shutdown of Veo 2 and Veo 3 generation models on June 30, 2026. The Veo deprecations are replaced by veo-3.1-generate-preview and veo-3.1-fast-generate-preview, plus GA models available through the Gemini Enterprise Agent Platform.
The pattern suggests Google is actively pruning older model snapshots while extending capabilities on models it's committing to long-term.
Why it matters
For voice assistant and voice agent developers, latency is the critical user experience metric. When a user asks a voice AI a question, they expect audio to begin almost immediately—not after a multi-second wait for the full generation to buffer. Streaming TTS collapses that wait: audio playback can begin as soon as the first chunks are ready, with the rest following in real time.
The gemini-3.1-flash-tts-preview streaming addition brings the Gemini API's speech capabilities to parity with what developers expect from production-grade voice pipelines. It removes a friction point that would have forced developers to either accept latency or engineer workarounds. For teams building voice-first agents, customer service bots, or real-time language learning tools on the Gemini stack, this is a direct quality-of-life upgrade.
Corroborating sources
- Changelog
https://ai.google.dev/gemini-api/docs/changelog
“Streaming via `streamGenerateContent` (and `stream: true` in the Interactions API) is now supported for the `gemini-3.1-flash-tts-preview` model.”