xAI Grok Voice Think Fast 1.0: #1 on Tau Voice Bench, $0.05 Per Minute API

xAI's Grok Voice Think Fast 1.0 Is #1 on Tau Voice Bench — And It Costs $0.05 Per Minute

xAI Releases Its Most Capable Voice Model Yet

xAI has released grok-voice-think-fast-1.0, its new flagship voice model and a significant step up from the Grok Voice Agent API that launched in late 2025. The model is purpose-built for complex, ambiguous, multi-step voice workflows — customer support, phone sales, appointment booking, data collection, and enterprise agent applications that involve dozens of tool calls per session.

The headline benchmark claim: grok-voice-think-fast-1.0 ranks #1 on the τ-voice Bench leaderboard — the benchmark that evaluates full-duplex voice agents under realistic conditions including telephony noise, heavy accents, interruptions, and turn-taking. This is the benchmark that matters most for real-world deployment, not synthetic studio audio.

What Makes It Different From Other Voice Models

Background Reasoning With No Latency Hit

The model's core architectural differentiator is what xAI calls background reasoning: grok-voice-think-fast-1.0 thinks through challenging queries and edge cases in real-time without adding any delay to the audio response. Most voice models face a direct trade-off — more reasoning means slower responses. xAI has decoupled the two. The model can work through a complex multi-step problem while simultaneously generating snappy audio output, which is why it scores on both accuracy and speed benchmarks simultaneously.

xAI published a concrete example: when asked which months are spelled with the letter X, a typical voice model answers confidently and incorrectly. grok-voice-think-fast-1.0 reasons through the edge case before responding and catches the mistake. This matters at enterprise scale: in customer support and sales, a confidently wrong answer ends the call and the relationship.

Real-World Noise Handling

The model has been battle-tested on the hardest voice input conditions: telephony audio quality, background ambient noise, heavy accents, and frequent interruptions. xAI built it with Starlink as a design partner specifically because Starlink's customer base is global and calls come in across dozens of languages, accents, and connection qualities. If a model works reliably on Starlink's support line, it works in the real world.

Structured Data Collection

One of the most commercially valuable capabilities is reliable structured data extraction from speech. grok-voice-think-fast-1.0 can collect email addresses, street addresses, phone numbers, full names, and account numbers — even when spoken quickly or with a strong accent. It handles speech disfluencies and natural corrections the way a human agent would: the caller says the wrong thing, corrects themselves mid-sentence, and the model extracts the intended value. It then invokes the appropriate tool with the corrected parameter and reads back the result for confirmation.

High-Volume Tool Calling

The Starlink deployment runs 28 distinct tools across hundreds of support and sales workflows in a single agent session. grok-voice-think-fast-1.0 was designed specifically for this — not just occasional tool calls to check a fact, but continuous, parallel tool orchestration across an entire customer interaction lifecycle. Most voice models degrade significantly when tool call frequency increases; this model was trained to maintain accuracy under that load.

Starlink Production Numbers

The most credible performance data in the announcement is from a live production deployment, not a benchmark. Grok Voice powers Starlink's phone line at +1 (888) GO STARLINK — handling customer support and phone sales across numerous languages:

Metric	Result	What It Means
Sales conversion rate	20%	1 in 5 sales inquiries ends in a purchase on the call
Support resolution rate	70%	Majority resolved with no human agent involved
Tools used per session	28	Distinct tools across hundreds of support and sales workflows

These are vendor-reported figures. Treat the conversion and resolution rates as directional rather than independently verified. But the Starlink deployment is real and publicly callable — you can test the model by dialing the number yourself, which is an unusual level of transparency for an AI product launch.

τ-voice Bench: The Benchmark That Matters for Voice Agents

Most voice AI benchmarks test clean studio audio at normal speaking pace. τ-voice Bench (Tau Voice Bench) evaluates full-duplex voice agents under conditions that reflect actual deployment: background noise, non-native accents, mid-sentence interruptions, and realistic turn-taking patterns. grok-voice-think-fast-1.0 takes the top spot on this leaderboard as of the April 24, 2026 release.

For context on the broader xAI voice stack: the Grok Voice Agent API (the predecessor product) ranked #1 on Big Bench Audio — the leading audio reasoning benchmark — and achieved under 1 second time-to-first-audio, nearly 5 times faster than the closest competitor at the time of that launch. grok-voice-think-fast-1.0 builds on that same in-house stack (custom VAD, tokenizer, and audio models built from scratch) and adds the reasoning layer and structured data collection capabilities needed for enterprise-grade deployment.

Pricing and API Access

xAI Voice API Surface	Price	Use Case
Voice Agent API (grok-voice-think-fast-1.0)	$0.05 / min	Live realtime conversation, tool calling, multi-turn sessions
Speech to Text — Batch	$0.10 / hr	Pre-recorded audio transcription, 25+ languages
Speech to Text — Streaming	$0.20 / hr	Real-time transcription via WebSocket API
Text to Speech	$4.20 / 1M chars	One-shot speech generation, 5 voices (Ara, Eve, Leo, Rex, Sal), 20 languages

The $0.05/min voice agent pricing is the competitive number. A 10-minute customer support call costs $0.50 in connection time. OpenAI's Realtime API is estimated at $0.10/min or higher in blended production use — xAI is claiming roughly half the cost. Tool calls (web search, X search, function calls) are billed separately on top of connection time. A session with 20 tool calls at $5/1,000 calls adds approximately $0.10 to a 10-minute session — still well under $1.00 for a complete customer interaction.

The API connects via WebSocket at wss://api.x.ai/v1/realtime and is compatible with the OpenAI Realtime API specification — existing applications built on the OpenAI standard can migrate without a complete rewrite. xAI also provides an official LiveKit plugin for teams already using LiveKit for media transport.

The Full xAI Voice Stack

grok-voice-think-fast-1.0 sits at the top of a now-complete xAI voice infrastructure that covers the full developer surface:

Voice Agent API — realtime WebSocket conversation with tool use. This is the product announced today.

Speech to Text API (generally available April 18, 2026) — batch and streaming transcription across 25+ languages with speaker diarization, word-level timestamps, and Inverse Text Normalization. On phone call entity recognition, Grok STT reports 5.0% word error rate versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%.

Text to Speech API — five expressive voices (Ara, Eve, Leo, Rex, Sal) across 20 languages with speech tags including [laugh], [sigh], and <whisper> for fine-grained control over vocal delivery. Priced at $4.20 per million characters.

All three surfaces run on the same stack that powers Grok Voice on iOS and Android, Tesla vehicles, and Starlink customer support — meaning the infrastructure is battle-tested at consumer scale before any developer runs a single API call.

Who Should Be Looking at This

Customer support teams running high call volume — the 70% autonomous resolution rate and 28-tool session architecture are the numbers to test against your current deflection rates. At $0.05/min, a 10-minute call costs $0.50 versus $8-15 for a human agent-minute in most markets.

Phone sales operations — the 20% conversion rate on Starlink's line is a vendor claim, but the Starlink line is publicly callable and independently testable. If your current phone sales conversion is below 20%, the comparison is worth making.

Developers migrating from OpenAI Realtime API — OpenAI Realtime API compatible spec means migration is a configuration change, not a rewrite. The cost differential makes evaluation low-risk.

Enterprise teams with multilingual support requirements — native-level proficiency across dozens of languages with automatic language detection and mid-conversation switching. If your support line handles non-English callers today with reduced quality, this addresses the core problem.

Teams already using ElevenLabs, Deepgram, or AssemblyAI for STT — the 5.0% vs 12.0-21.3% entity recognition error rate comparison is the sharpest claim in the announcement. If your use case involves extracting names, account numbers, or addresses from phone calls, the accuracy gap is substantial enough to warrant a direct evaluation.

xAI's Grok Voice Think Fast 1.0 Is #1 on Tau Voice Bench — And It Costs $0.05 Per Minute