ReleaseOpenAIVerified

OpenAI launches three new realtime voice models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper

ListenJun 7, 2026published Jun 7, 2026

OpenAI on June 7, 2026 published a detailed overview of three new realtime voice models now available in the API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. The models advance realtime voice AI beyond simple call-and-response, adding GPT-5-class reasoning, live multilingual translation, and streaming transcription to the Realtime API.

What's new

GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning capability. "Realtime 2 adds reasoning to speech-to-speech workflows," enabling the model to handle complex requests and carry conversations forward intelligently rather than producing reactive, pattern-matched responses. It supports an adjustable reasoning.effort parameter (low/medium/high) for latency management in production voice agents.

GPT-Realtime-Translate is a live translation model supporting 70+ input languages across 13 output languages, translating speech while keeping pace with the speaker — enabling speech-to-speech translation in a single model call.

GPT-Realtime-Whisper is a streaming speech-to-text model that transcribes live audio as the speaker talks, with controllable latency for production deployments.

All three models are available through the OpenAI Realtime API.

Context

OpenAI launched the Realtime API in 2024, enabling low-latency audio interactions using GPT-4o's native multimodal audio capabilities. The new 2026 model family represents a generational step: moving from GPT-4o-era voice capability to GPT-5-class reasoning in a realtime speech context.

The addition of native reasoning to a voice model is significant. Prior realtime audio models could respond to spoken queries, but could not reason through complex problems in the way GPT-5-class text models can. GPT-Realtime-2 closes that gap.

The translation model similarly consolidates what previously required a multi-step pipeline (speech-to-text → translation model → text-to-speech) into a single model call, reducing latency and integration complexity.

Why it matters

Voice has historically lagged text in AI capability. These models bring the reasoning depth of frontier text models to spoken interfaces for the first time.

For developers building voice agents — customer service, healthcare communication tools, educational assistants, accessibility products — reasoning-capable voice AI enables use cases that were impractical with earlier models. An agent that can reason through a multi-step problem while the user speaks is qualitatively different from one that pattern-matches to preset responses.

The translation model opens real-time multilingual voice applications at scale, in a market where specialized translation services have historically held an advantage over general-purpose AI.

These models collectively push the practical frontier for production voice AI deployments.

Corroborating sources

Developers.openai
https://developers.openai.com/api/docs/guides/realtime
“Realtime 2 adds reasoning to speech-to-speech workflows.”