QUICK ANSWER
OpenAI launched three new real-time audio models and moved the Realtime API to GA in May 2026. The three models: GPT-Realtime-2 (conversational agents, GPT-5-class reasoning, continuous audio stream), GPT-Realtime-Translate (live translation, 70+ input / 13 output languages), and GPT-Realtime-Whisper (streaming live transcription). All three are available via the OpenAI API. The Realtime API is no longer in beta - production deployments are now officially supported.
Why the Architecture Matters - Continuous Stream vs Pipeline
Most voice AI systems prior to GPT-Realtime-2 used a three-step pipeline: a speech recognition model transcribes audio to text, an LLM processes the text and generates a text response, a text-to-speech model converts that response back to audio. Each step adds latency, and the LLM in the middle loses information that only existed in the audio - tone, emotional register, speaking pace, interruptions, background noise, multiple speakers.
GPT-Realtime-2 processes audio in a continuous stream as a single model. It hears the audio directly, reasons about it, and generates audio output - without the transcription and synthesis gaps. The result is lower latency (it can begin responding before the speaker has finished), better handling of prosodic information (tone, emphasis, interruptions), and the ability to reason about audio events like laughter, silence, or background noise that would be lost in a transcription pipeline.
This architecture is what makes GPT-Realtime-2 suitable for genuine conversational agents - not just voice-controlled chatbots, but systems that can participate in a natural conversation, interrupt, be interrupted, and maintain context across a back-and-forth exchange that feels like talking to a person rather than dictating commands to a system.
The Three Models - What Each Does and When to Use It
GPT-Realtime-2
What it does: Full conversational voice agent. Processes audio continuously, reasons with GPT-5-class intelligence, responds in audio. Supports interruption handling, emotion detection, multi-speaker conversations.
Use cases: AI phone agents, customer support, real-time coaching, interactive voice response with actual reasoning capability, voice-controlled agentic systems.
API: WebSocket or WebRTC interface via the Realtime API endpoint. Access: Generally available - no beta suffix required.
GPT-Realtime-Translate
What it does: Live multilingual voice translation. Each person speaks in their preferred language; the model translates and speaks in the target language in real time. Supports 70+ input languages and 13 output languages.
Use cases: Cross-border customer support (Deutsche Telekom confirmed testing), international sales calls, live event translation, multilingual education platforms, product education videos with live translation as they play (Vimeo demonstration).
Key design challenge addressed: Preserving meaning while keeping pace with natural speech, including regional pronunciation and domain-specific language.
GPT-Realtime-Whisper
What it does: Streaming live transcription. Transcribes audio as people speak, producing live captions in real time rather than after recording ends. Built on Whisper's speech recognition technology extended to streaming contexts.
Use cases: Live meeting captions, real-time subtitles, live customer call transcription for CRM logging, accessibility tools, live broadcast captioning.
Difference from original Whisper: Original Whisper was post-recording batch transcription. GPT-Realtime-Whisper produces transcription as speech occurs - making it usable inside live business workflows.
Pricing and API Access
| Model |
Audio Input |
Audio Output |
Text tokens |
Interface |
| GPT-Realtime-2 |
$40/M tokens |
$80/M tokens |
Standard GPT-5 rates |
WebSocket / WebRTC |
| GPT-Realtime-Translate |
$40/M tokens |
$80/M tokens |
- |
WebSocket / WebRTC |
| GPT-Realtime-Whisper |
$3/M tokens |
- |
- |
WebSocket streaming |
Audio token pricing is significantly higher than text token pricing because audio requires substantially more compute per token. GPT-Realtime-2 at $40 input / $80 output per million audio tokens is expensive for high-volume deployments - a one-hour customer support call generating continuous audio input and output can cost several dollars. For high-value interactions (enterprise sales calls, medical consultations, legal proceedings) the economics are straightforward. For commodity support calls at scale, the cost per interaction needs careful modelling before production deployment.
GPT-Realtime-Whisper at $3/M tokens is the most accessible entry point - streaming transcription is significantly cheaper than full conversational AI. For teams whose primary need is live captions or real-time CRM logging, Whisper is the right starting point before considering GPT-Realtime-2.
Frequently Asked Questions
How does GPT-Realtime-2 differ from the original GPT-4o Realtime that launched in 2024?
The original GPT-4o Realtime (2024) was based on GPT-4o-class reasoning. GPT-Realtime-2 is built on GPT-5-class reasoning - a significant capability upgrade. GPT-Realtime-2 also exits the preview stage that GPT-4o Realtime occupied; it is now generally available for production use. The Realtime API itself is also generally available rather than in beta, meaning SLA commitments and production support apply.
Does GPT-Realtime-Translate replace the need for a human interpreter?
For many commercial use cases, yes. Deutsche Telekom is testing it for multilingual customer service interactions. Vimeo demonstrated live product video translation. For high-stakes contexts - legal proceedings, medical diagnoses, diplomatic communications - human interpretation remains appropriate given the consequences of translation errors. GPT-Realtime-Translate is best positioned for high-volume, moderate-stakes multilingual interactions where speed and cost matter more than zero error tolerance.
Is WebRTC or WebSocket better for GPT-Realtime-2?
WebRTC is better for browser-based applications where peer-to-peer audio streaming is needed - it handles codec negotiation, NAT traversal, and jitter buffering automatically. WebSocket is better for server-side applications, mobile apps with custom audio handling, or any case where you need more control over the audio pipeline. OpenAI supports both; choose based on your application architecture.