A pile of three-letter words gets thrown around in voice-AI architecture. None of them are mysterious — they're just the layers in the audio pipeline.
<Connect><Stream url="wss://..."/></Connect> to open a media stream to ElevenLabs (or wherever)./api/elevenlabs/conversation-init) that ElevenLabs hits before each call begins. You return per-call overrides (language, first message, prompt, voice) based on who's calling. This is how Onion does multi-language routing per caller number.This is what's deployed right now and answers your phone every time someone dials +1 (234) 562-2962. Five actors talk to each other; ElevenLabs is the bus driver.
Total perceived turn latency in this stack runs ~600–1200 ms from the moment you stop speaking to the moment you hear the agent reply. The trick is that nothing is sequential — every layer streams:
Everything overlaps. By the time the LLM has finished generating its reply, the first half of it has already been spoken to the caller.
Twilio is dominant globally but expensive and sometimes regulatory-heavy in India and the Middle East. Exotel (India), Plivo, Sinch (MENA), and Tata Communications (enterprise India) all do the same job: rent a number, connect calls. Architecturally they slot in cleanly.
When we ran POST /api/twilio/link-agent, ElevenLabs took ownership of the Twilio number's voice URL behind the scenes. They also accept SIP trunk registration, but they don't have a one-click "import from Exotel" button. Two ways around this:
| Twilio | Exotel | |
|---|---|---|
| US local number cost | $1.15/mo | n/a (US not their market) |
| India local number cost | ~$5/mo + KYC bundle hassle | ~₹500/mo, India-native KYC |
| Onboarding for India | Regulatory bundle paperwork (weeks) | Same-day |
| Latency to Indian callers | 120–250ms (US edge) | 30–80ms (Mumbai edge) |
| ElevenLabs native integration | Built-in | SIP trunk or DIY bridge |
| API quality | Excellent docs, Twilio CLI | OK docs, no CLI |
| DLT registration (India outbound) | Pre-built compliance | You handle it (templates available) |
| Best for | US, UK, EU | India inbound, lower cost, lower latency |
Critically: ElevenLabs, OpenAI, and your Onion code are all unchanged. The conversation-init webhook fires identically. Your number_mappings still work — same caller_id format, same response format. The whole right-hand side of the architecture diagram doesn't move.
This is the value of designing telephony as an adapter: you switch the provider, not the product.
Option A — Stay on ElevenLabs, train a voice clone in the target language. Upload 30 minutes of clean Malayalam recordings from a native speaker, ElevenLabs trains a Professional Voice Clone, you get a voice_id. The TTS engine is still ElevenLabs' multilingual model (eleven_v3_conversational), but with that person's voice and accent. Drop the voice_id into your number mapping. Done.
Option B — Leave ElevenLabs CAI only for calls in the problem language; build a DIY pipeline for those. This is what Vapi/Retell built. You replace the entire conversation orchestrator for those specific calls — Sarvam for ASR + TTS, OpenAI for LLM, your own server for state + turn-taking + tool calls. ElevenLabs is bypassed.
Option A is one afternoon. Option B is multiple weeks of infrastructure. Pick A unless quality is genuinely a blocker.
If you must use Sarvam for actual TTS+ASR, the architecture splits into two paths. English / non-problematic-language calls keep going through ElevenLabs CAI. Calls flagged for Sarvam-language treatment route to your DIY pipeline:
Notice that "DIY pipeline" is doing four jobs ElevenLabs used to do for you in one box: turn-taking, VAD, ASR, and TTS. Plus you have to wire your tool-calling system back in (Onion's create_booking webhook etc.) since ElevenLabs no longer manages it.
| Engine | What they offer | License |
|---|---|---|
| Sarvam AI | Best-in-class Indian-language TTS + ASR. 11 Indian languages. Production API. | Commercial — pay per use |
| AI4Bharat IIT Madras | Open-source ASR (IndicConformer) + TTS (IndicTTS). 22 Indian languages. | Free, MIT — self-host |
| Bhashini Govt of India | State-funded translation + TTS + ASR for Indian languages. API access via signup. | Free with limits |
| ElevenLabs Voice Clone (your existing tool) | Same TTS engine, with a custom voice — works in 32 languages including Malayalam, Hindi, Tamil, Arabic. | Tier-based |
Stay on Option A (voice cloning) for the next 6 months. It's a 20-minute experiment per language, not a 6-week build. If — after professional cloning — Malayalam still sounds wrong, the model itself is the limit, and that's when Option B becomes worth the cost. By then you'll also have paying customers in those languages, which is the only good reason to take on the complexity.