Quick Answer
xAI's Custom Voices feature went live on May 2, 2026. Record roughly 60 seconds of audio in the xAI console, pass a two-stage voice verification, and your cloned voice is ready in under two minutes — callable via the same TTS and Voice Agent API endpoints as the 80+ preset voices. No extra charge beyond the standard $4.20/million characters TTS rate. ElevenLabs charges $60–$120/million characters for equivalent features. That's the headline.
What Launched and When
xAI shipped Custom Voices on May 2, 2026, bundled with the Grok 4.3 model release. The feature adds voice cloning directly to the xAI API at https://api.x.ai/v1/custom-voices, alongside a new Voice Library in the xAI console — a single page where teams can browse, preview, and manage both custom clones and the 80+ built-in preset voices across 28 languages.
This is the fourth major voice capability xAI has shipped since December 2025. The full timeline: Voice Agent API (December 2025), TTS and STT standalone APIs (April 18, 2026), Voice Mode on X (March 2026), and now Custom Voices (May 2, 2026). Each layer built on production infrastructure already serving Grok mobile apps, Tesla vehicles, and Starlink customer support.
Comparison: Grok Custom Voices vs ElevenLabs vs OpenAI TTS
| Feature |
Grok (xAI) |
ElevenLabs |
OpenAI TTS |
| TTS price (per 1M chars) |
$4.20 |
$60–$120 |
~$15 |
| Voice Agent API |
$0.05/min |
$0.08–$0.12/min |
Realtime API available |
| Voice cloning fee |
Free (no extra charge) |
Per-character / per-tier |
Not available |
| Min. audio to clone |
~60 seconds |
Varies by tier |
N/A |
| Languages supported |
28 |
32+ |
Multiple |
| Preset voice library |
80+ voices |
Large community marketplace |
6 preset voices |
| Emotional speech tags |
[laugh], [sigh], <whisper>, <emphasis> |
Emotion settings + style controls |
Limited |
| Phone call benchmark (WER) |
5.0% error rate |
12.0% error rate |
Not published |
How to Clone Your Voice with the Grok API
The process runs entirely through the xAI console at console.x.ai. There is no SDK step for the initial clone creation — you record, verify, and receive a voice_id, then use that ID in any API call exactly as you would use a preset voice.
Step 1: Record the verification phrase
The console prompts you to read a specific phrase aloud. xAI's speech-to-text engine transcribes and matches the recording in real time, confirming you are physically present and consenting. This blocks cloning from a pre-existing audio file — you cannot feed a recorded clip of someone else's voice and generate a passphrase match.
Step 2: Record ~60 seconds of natural speech
After the passphrase step, record roughly one minute of natural, conversational speech. The pipeline extracts speaker embeddings from both the passphrase recording and the full sample, then compares them to confirm they belong to the same speaker. xAI says the clone is production-ready in under two minutes from this point.
Step 3: Use your voice_id anywhere in the API
Your custom voice gets an 8-character alphanumeric ID. Pass it to the TTS REST endpoint, the WebSocket streaming endpoint, or the Voice Agent realtime API — identical syntax to a preset voice. It inherits all TTS capabilities: inline speech tags ([laugh], [sigh], [breath], <whisper>, <emphasis>), multilingual output across 28 languages, and both REST and WebSocket delivery.
A minimal Python TTS call with a custom voice looks like this:
import anthropic_xai # xAI Python SDK
response = client.audio.speech.create(
model="grok-tts",
voice="cx7a3b2d", # your 8-char custom voice_id
input="Hello — this is my cloned voice speaking."
)
response.stream_to_file("output.mp3")
Grok Custom Voices: Pricing Breakdown
xAI charges nothing extra to create or use a cloned voice. The rates that apply are the standard Grok voice API rates:
- TTS API: $4.20 per million characters (REST or WebSocket streaming)
- Voice Agent API (realtime): $0.05 per minute ($3.00 per hour)
- STT API: $0.10/hour (batch) or $0.20/hour (streaming), across 25 languages
- Custom voice cloning: $0 — no per-clone creation fee
For context: ElevenLabs charges $60–$120 per million characters for its TTS API. At xAI's $4.20 rate, that is a 14–28x price gap on the same workload. ElevenLabs Conversational AI agents run $0.08–$0.12 per minute; Vapi runs $0.05–$0.09; Retell runs $0.10–$0.31. xAI's $0.05/minute flat rate sits at the low end of the market.
What It's Actually Good For
Customer support agents with a consistent brand voice
Record a brand voice once, deploy it across every support interaction. The cloned voice passes through the same Voice Agent API that already runs on Tesla vehicles and Starlink support — infrastructure that has handled real production load before this feature opened to external developers.
Content at scale — audiobooks, podcasts, video narration
Narrate a 10-hour audiobook in your own voice without re-recording every chapter. The emotional speech tags ([laugh], [sigh], <emphasis>) do some of the work that human studio direction normally handles. For English-language content where naturalness matters, xAI's voice quality is competitive in independent testing, though ElevenLabs still leads on emotional expressiveness depth and non-English language quality.
Accessibility — preserving voices for people who lose them
xAI lists this explicitly in its use case documentation: creating personalised voices for individuals who have lost the ability to speak, preserving their vocal identity. The 60-second minimum recording threshold is a practical constraint here — voice banking while still healthy requires planning ahead.
Multilingual delivery from one voice
A CEO's keynote recorded once in English can be delivered in Spanish, French, German, Chinese, Japanese, and other supported languages using the same voice model. The clone inherits multilingual capability from the base TTS stack without any additional setup.
Honest Limitations
The 60-second floor is higher than rivals. Alibaba's Qwen3-TTS clones from just 3 seconds of audio. If your use case involves extremely short audio samples — think call-centre recordings or archived clips — xAI's 60-second minimum is a real constraint.
No independent security audit published. xAI has not released false-acceptance rates, anti-spoofing measures, or red-team results for its two-stage verification system. The claim that pre-existing recordings cannot be used to clone a voice remains a launch-page assertion, not a peer-reviewed result. Researchers have not independently tested whether synthesised passphrases or replayed audio can defeat the speaker-embedding gate.
ElevenLabs still leads on non-English quality. Independent testing shows ElevenLabs ahead on multilingual voice quality, particularly Spanish, and on overall emotional expressiveness. For English-only production workloads where cost is the primary constraint, xAI is hard to justify avoiding. For maximum naturalness or non-English content, ElevenLabs is still stronger.
Grok 4.3 "narcolepsy" is a separate concern. Early community reports of the underlying Grok 4.3 model entering states of excessive caution in agentic workflows affect Voice Agent use cases more than standard TTS. If you're building autonomous voice agents, monitor this before committing to production.
Decision Framework
Which voice API should you use?
If cost per character is your primary constraint → Grok Custom Voices. The 14–28x price gap over ElevenLabs is real and documented at current API rates.
If you need maximum voice naturalness in Spanish, French, or non-English languages → ElevenLabs still leads on multilingual expressiveness depth.
If you're already using Grok 4.3 for text or agents → Custom Voices is zero incremental integration — same endpoint, same credentials, no separate vendor.
If you need voice cloning from very short samples (<60 seconds) → ElevenLabs or Qwen3-TTS. xAI's 60-second floor is a hard constraint.
If you're building for accessibility / voice banking → Grok Custom Voices works, but requires planning — users must record 60+ seconds while still able to speak clearly.
Workflow Stack: Grok Voice in a Real Production Pipeline
A customer support voice agent built on xAI in May 2026 might look like this:
- Grok 4.3 (text) handles intent classification and response generation — $1.25/million input tokens, $2.50/million output tokens.
- Grok STT transcribes incoming caller audio — $0.20/hour streaming, 25 languages, speaker diarization included.
- Grok Custom Voices TTS delivers the response in a consistent brand voice — $4.20/million characters, emotional speech tags for naturalness.
- Grok Voice Agent API ties the realtime session together — $0.05/minute, WebSocket streaming, OpenAI Realtime API-compatible specification so existing code can point to
wss://api.x.ai/v1/realtime with minimal changes.
The entire stack runs through one API key, one billing account, and one endpoint surface. That consolidation reduces vendor management overhead and eliminates the latency cost of routing audio through a separate third-party STT or TTS provider.
Frequently Asked Questions
Is Grok voice cloning free?
Creating a custom voice clone costs nothing. Using it is billed at the standard TTS rate of $4.20 per million characters — the same rate as any preset voice. There is no per-clone creation fee and no separate licensing charge for the Custom Voices feature.
Can I clone someone else's voice with the Grok API?
No. xAI's two-stage verification requires the speaker to read a live passphrase aloud in real time. The system matches speaker embeddings from both the passphrase clip and the full recording, then rejects mismatches. You also cannot submit a pre-existing audio file in place of a live recording. Independent researchers have not yet published results testing these claims against synthesised passphrases.
How does Grok TTS pricing compare to ElevenLabs?
Grok TTS costs $4.20 per million characters. ElevenLabs charges $60–$120 per million characters depending on the plan — a 14–28x difference at current rates (as of May 2, 2026). For a full pricing breakdown of Grok Imagine and other Grok API features, see our Grok API pricing guide.
Does the Grok Voice Agent API work with OpenAI Realtime API code?
Yes, with modifications. The Voice Agent API uses the same mental model as OpenAI's Realtime API — stateful sessions, streaming events, tool use, and live audio patterns. Change the WebSocket endpoint to wss://api.x.ai/v1/realtime and update any event names that differ (for example, response.text.delta instead of response.output_text.delta). Most existing Realtime API code requires only those targeted changes.
What languages does Grok Custom Voices support?
Custom voices inherit support for all 28 languages in xAI's TTS stack. The built-in voice library covers the same 28 languages, with the xAI console listing English, Spanish, French, German, Chinese, and Japanese among the supported outputs. Custom clones route through the same multilingual model, so language is set at inference time — not at clone-creation time.
How does Grok STT compare to ElevenLabs for phone calls?
In phone-call entity recognition benchmarks, Grok STT records a 5.0% word error rate against ElevenLabs' 12.0% — a meaningful gap for call-centre and voice agent use cases. For general English voice quality, the gap is narrower. ElevenLabs leads on multilingual quality and has a larger community voice marketplace. For English-only, cost-sensitive production workloads, Grok STT is the stronger choice on current benchmark data.