xAI adds Smart Turn end-of-turn detection to the streaming Speech-to-Text API
xAI has added Smart Turn end-of-turn detection to its streaming Speech-to-Text API, a feature aimed at one of the most stubborn problems in voice interfaces — knowing when a speaker has actually finished speaking versus pausing briefly mid-thought. The change appears in the xAI release notes under the current May listing and is enabled per-request via a query parameter.
What's new
The xAI release notes state: "The streaming Speech to Text API now supports Smart Turn end-of-turn detection. When enabled via the smart_turn query parameter, an ML model predicts whether the speaker has finished their thought at silence boundaries — reducing false endpointing during dictation, number sequences, and mid-sentence pauses."
Key specifics from the release note:
- Activated via a
smart_turnquery parameter on the streaming STT endpoint — opt-in rather than a default behavior change. - An ML model (not a fixed silence-duration heuristic) decides whether the turn has ended.
- The decision runs at silence boundaries, the same trigger point where traditional voice activity detection would otherwise cut off a turn.
- The release note explicitly calls out three failure modes the model is designed to handle: dictation, number sequences, and mid-sentence pauses.
Context
End-of-turn detection is the connective tissue of any real-time voice agent. Traditional voice activity detection (VAD) systems use audio-energy thresholds and a silence timer to decide when a user has stopped talking. They are brittle in exactly the scenarios xAI calls out: a person dictating a long passage pauses to think, a customer reads out a sixteen-digit card number with natural breaks, or a caller hedges mid-sentence. In each of those cases, naive VAD declares the turn over and the agent starts responding before the user is done — a failure that compounds quickly across a multi-turn call.
xAI launched its Grok Voice Agent API to general availability in December 2025, which made turn detection an immediate weak point relative to dedicated voice infrastructure providers that have shipped ML-based endpointing for over a year. Today's addition closes that specific gap inside xAI's own STT stack rather than requiring developers to layer a third-party model on top of Grok voice workflows. The Smart Turn note sits alongside two earlier xAI moves in late May — a websocket mode for the Grok Responses API and a context-compaction API — that together point to a sustained push on real-time and long-session developer ergonomics rather than purely on model intelligence.
The xAI release notes index the entry under "May" without a specific day; the change was first detected by the ModelDex news scan on 2026-06-03.
Why it matters
For developers building voice agents on xAI, this is one fewer dependency. Smart Turn-style detection has historically been a justification for choosing a specialized voice platform over a generalist model API; pulling it inside the xAI surface area means a leaner stack and tighter latency budget, since the endpointing decision no longer requires a round-trip through a separate inference path.
For the broader voice-agent ecosystem, xAI is now the third major foundation-model provider to ship native ML-based turn detection inside its own STT pipeline, alongside earlier work from OpenAI's realtime API and Google's voice tooling. The center of gravity for voice infrastructure is shifting away from middleware providers and into the model vendors themselves — the same compression pattern that earlier consumed standalone vector databases and dedicated transcription companies.
The opt-in design — a query parameter rather than a default — is a sensible choice. Switching endpointing strategy mid-product is the kind of change that breaks carefully calibrated voice UX, and forcing existing customers onto a new behavior would be disruptive. Open questions at launch: latency overhead of the ML model relative to legacy VAD, accuracy on non-English speech, and whether smart_turn graduates to default behavior over time as xAI gains confidence in its calibration.
Corroborating sources
- Docs.x
https://docs.x.ai/docs/release-notes
“The streaming Speech to Text API now supports Smart Turn end-of-turn detection.”