Grok Voice

xAI announced the beta launch of Voice Agent Builder yesterday, a no-code platform for configuring production voice agents on Grok Voice. The tool bundles telephony, knowledge retrieval, tools, guardrails, MCP support and observability into a single interface, aimed at operators and developers who want high-volume voice agents without assembling the stack themselves.

Most voice platforms stitch together separate speech-to-text, language model and text-to-speech services, often from different providers, with each handover adding cost, latency and a fresh point of failure. xAI's alternative runs on a single speech-to-speech system built specifically for Grok Voice, avoiding that relay between components altogether.

xAI says Grok Voice was trained on difficult real-world calls, including poor telephony audio, background noise, heavy accents, interruptions and callers changing their minds mid-sentence, across more than 25 languages. On its own τ-voice Bench, which tests agents under these conditions, xAI reports Grok Voice Think Fast 1.0 scoring 67.3% overall, ahead of Gemini 3.1 Flash Live on 43.8% and GPT Realtime 1.5 on 35.3%.

Setup involves writing a plain-language description of how calls should flow, then attaching documents, tools and guardrails; xAI says this can produce a working agent in around two minutes. Pricing is billed at the API rate of $0.05 per minute of audio, with voices included and no separate platform fee, plus $0.01 per minute for telephony on a free provisioned number.