FeaturexAIVerified

xAI ships Context Compaction API, WebSocket Responses mode, and Smart Turn STT detection in a single May 29 update

ListenMay 29, 2026published Jun 5, 2026

xAI shipped three API features on May 29, 2026 targeting production agentic workloads: a Context Compaction API for compressing long conversations into reusable shorter contexts, a WebSocket Responses API mode for lower-latency tool-heavy agents, and Smart Turn end-of-turn detection in the Streaming Speech-to-Text API. All three address common pain points in running agents at scale.

What's new

Context Compaction API

Now generally available on the xAI API
Compresses a long conversation into a shorter context snapshot that can be reused in follow-up requests
Benefits: lower cost per request, faster time-to-first-token, sharper responses in long agent loops where context would otherwise grow unbounded
Designed for agent workloads where session history accumulates over many turns

WebSocket Responses API Mode

Drives the Responses API over a single, persistent WebSocket connection instead of per-request HTTP calls
Reduces end-to-end latency for tool-heavy agents that make many rapid tool calls per session
Eliminates HTTP connection setup overhead on each tool invocation
Well-suited for agents with tight latency budgets and high tool-call frequency

Smart Turn (Streaming STT)

Available via the smart_turn query parameter in the streaming Speech-to-Text API
Uses a machine learning model to predict whether a speaker has finished their turn at silence boundaries — rather than cutting off after a fixed silence timeout
Reduces false endpointing during dictation, number sequences, and mid-sentence pauses
smart_turn_timeout parameter sets a maximum silence fallback for cases where the ML model is uncertain

Context

These three features arrived together but address different layers of the agentic stack. Context Compaction is memory management: long-running agents accumulate state, and compressing that state changes the economics of multi-turn sessions. The problem is not new — it is the same challenge that Anthropic addresses with expanded context windows and prompt caching, approached through compression rather than raw capacity.

WebSocket Responses Mode is infrastructure optimization. HTTP request overhead is measurable in high-frequency tool-calling agents where the model is orchestrating dozens of tool calls per user interaction. A persistent WebSocket connection eliminates that overhead.

Smart Turn is a voice-specific quality-of-life feature. Current STT systems use simple silence-duration heuristics to detect end-of-turn, which fail on common speech patterns — pauses mid-sentence, number reading, thinking aloud. An ML-based detector that understands linguistic completion is a meaningful improvement for voice agents.

Why it matters

Context Compaction is the standout feature economically. Production agent workloads accumulate context fast: a customer service agent running for 30 minutes may accumulate tens of thousands of tokens. Compressing that into a reusable snapshot before making the next API call can cut input token costs significantly — especially for agents that re-anchor their context frequently.

WebSocket Responses Mode is a compounding advantage for xAI on latency-sensitive workloads. In agentic coding or customer-facing agents where tool calls are tightly coupled to user-visible actions, shaving connection overhead off each call adds up.

Smart Turn is narrow but meaningful for voice agent builders specifically. The failure mode it fixes — cutting off a speaker mid-sentence — is one of the most common complaints in deployed voice agent products. An ML-based detection layer that understands speech patterns rather than just silence duration is a differentiated capability in the market.

Corroborating sources

Docs.x
https://docs.x.ai/docs/release-notes
“The Context Compaction API is now available. You can shrink long conversations into a shorter context and reuse it in follow-up requests for lower cost, faster time-to-first-token, and sharper responses on long agent loops.”

What's new

Context Compaction API

Now generally available on the xAI API

Compresses a long conversation into a shorter context snapshot that can be reused in follow-up requests

Benefits: lower cost per request, faster time-to-first-token, sharper responses in long agent loops where context would otherwise grow unbounded

Designed for agent workloads where session history accumulates over many turns

WebSocket Responses API Mode

Drives the Responses API over a single, persistent WebSocket connection instead of per-request HTTP calls

Reduces end-to-end latency for tool-heavy agents that make many rapid tool calls per session

Eliminates HTTP connection setup overhead on each tool invocation

Well-suited for agents with tight latency budgets and high tool-call frequency

Smart Turn (Streaming STT)

Available via the smart_turn query parameter in the streaming Speech-to-Text API

Uses a machine learning model to predict whether a speaker has finished their turn at silence boundaries — rather than cutting off after a fixed silence timeout

Reduces false endpointing during dictation, number sequences, and mid-sentence pauses

smart_turn_timeout parameter sets a maximum silence fallback for cases where the ML model is uncertain

Context

Why it matters