xAI ships Context Compaction API, WebSocket Responses mode, and Smart Turn STT detection in a single May 29 update
xAI shipped three API features on May 29, 2026 targeting production agentic workloads: a Context Compaction API for compressing long conversations into reusable shorter contexts, a WebSocket Responses API mode for lower-latency tool-heavy agents, and Smart Turn end-of-turn detection in the Streaming Speech-to-Text API. All three address common pain points in running agents at scale.
What's new
Context Compaction API
- Now generally available on the xAI API
- Compresses a long conversation into a shorter context snapshot that can be reused in follow-up requests
- Benefits: lower cost per request, faster time-to-first-token, sharper responses in long agent loops where context would otherwise grow unbounded
- Designed for agent workloads where session history accumulates over many turns
WebSocket Responses API Mode
- Drives the Responses API over a single, persistent WebSocket connection instead of per-request HTTP calls
- Reduces end-to-end latency for tool-heavy agents that make many rapid tool calls per session
- Eliminates HTTP connection setup overhead on each tool invocation
- Well-suited for agents with tight latency budgets and high tool-call frequency
Smart Turn (Streaming STT)
- Available via the
smart_turnquery parameter in the streaming Speech-to-Text API - Uses a machine learning model to predict whether a speaker has finished their turn at silence boundaries — rather than cutting off after a fixed silence timeout
- Reduces false endpointing during dictation, number sequences, and mid-sentence pauses
smart_turn_timeoutparameter sets a maximum silence fallback for cases where the ML model is uncertain
Context
These three features arrived together but address different layers of the agentic stack. Context Compaction is memory management: long-running agents accumulate state, and compressing that state changes the economics of multi-turn sessions. The problem is not new — it is the same challenge that Anthropic addresses with expanded context windows and prompt caching, approached through compression rather than raw capacity.
WebSocket Responses Mode is infrastructure optimization. HTTP request overhead is measurable in high-frequency tool-calling agents where the model is orchestrating dozens of tool calls per user interaction. A persistent WebSocket connection eliminates that overhead.
Smart Turn is a voice-specific quality-of-life feature. Current STT systems use simple silence-duration heuristics to detect end-of-turn, which fail on common speech patterns — pauses mid-sentence, number reading, thinking aloud. An ML-based detector that understands linguistic completion is a meaningful improvement for voice agents.
Why it matters
Context Compaction is the standout feature economically. Production agent workloads accumulate context fast: a customer service agent running for 30 minutes may accumulate tens of thousands of tokens. Compressing that into a reusable snapshot before making the next API call can cut input token costs significantly — especially for agents that re-anchor their context frequently.
WebSocket Responses Mode is a compounding advantage for xAI on latency-sensitive workloads. In agentic coding or customer-facing agents where tool calls are tightly coupled to user-visible actions, shaving connection overhead off each call adds up.
Smart Turn is narrow but meaningful for voice agent builders specifically. The failure mode it fixes — cutting off a speaker mid-sentence — is one of the most common complaints in deployed voice agent products. An ML-based detection layer that understands speech patterns rather than just silence duration is a differentiated capability in the market.
Corroborating sources
- Docs.x
https://docs.x.ai/docs/release-notes
“The Context Compaction API is now available. You can shrink long conversations into a shorter context and reuse it in follow-up requests for lower cost, faster time-to-first-token, and sharper responses on long agent loops.”