Onion · Architecture Notes

How a phone call actually works End-to-end across Twilio, ElevenLabs, OpenAI, and the Onion conversation-init webhook — and how to swap providers without breaking the rest.

Glossary Acronyms unpacked

A pile of three-letter words gets thrown around in voice-AI architecture. None of them are mysterious — they're just the layers in the audio pipeline.

ASR — Automatic Speech Recognition: Converts a stream of audio (the caller's voice) into text. Examples: ElevenLabs Scribe, OpenAI Whisper, Google Speech-to-Text, Sarvam ASR. The faster ASR detects "the user stopped talking" the lower the call latency.
TTS — Text-to-Speech: The opposite direction: takes text and synthesises spoken audio. ElevenLabs' core product is industry-best TTS. Sarvam and Bhashini are India-focused TTS engines.
LLM — Large Language Model: The "brain" deciding what to say next. GPT-4o, Claude, Llama. ElevenLabs Conversational AI wraps an LLM and feeds it the running transcript turn by turn.
VAD — Voice Activity Detection: A small model that decides "is the caller speaking right now or just paused?" Used by ASR to know when to commit a transcript and trigger the LLM. Tuning VAD timeouts is critical for low-latency conversation.
TTFT / TTFB — Time to First Token / Byte: How fast a streaming model starts emitting output. ElevenLabs Flash hits ~75ms TTFB on TTS. OpenAI streaming hits ~50–300ms TTFT on the LLM. Stack them and you can keep total turn latency under a second.
PSTN — Public Switched Telephone Network: The traditional global phone system. When someone dials your Twilio number from their mobile, the call traverses PSTN until it hits Twilio's edge.
SIP — Session Initiation Protocol: The signalling protocol that lets internet-side services (Twilio, Exotel, ElevenLabs) accept and route phone calls. ElevenLabs imports your Twilio number behind the scenes by registering a SIP endpoint.
TwiML — Twilio Markup Language: An XML-based instruction format. When Twilio receives an inbound call, it asks "what should I do?" and your webhook returns TwiML like <Connect><Stream url="wss://..."/></Connect> to open a media stream to ElevenLabs (or wherever).
WSS — WebSocket Secure: A persistent, bidirectional, low-latency channel between Twilio and the AI provider. Audio chunks flow both ways for the entire duration of the call. No HTTP request/response per turn — that would be way too slow.
WebRTC: A peer-to-peer browser-friendly real-time audio/video standard. ElevenLabs uses it for the in-browser voice widget. Different from PSTN/SIP — that's the "phone" path; WebRTC is the "browser" path.
Conversation Initiation Webhook: An HTTP endpoint you host (Onion has one at /api/elevenlabs/conversation-init) that ElevenLabs hits before each call begins. You return per-call overrides (language, first message, prompt, voice) based on who's calling. This is how Onion does multi-language routing per caller number.
Bundled vs Component voice provider: Bundled = one vendor handles ASR + LLM-orchestration + TTS + telephony bridge (ElevenLabs CAI, Vapi, Retell). Component = you assemble individual pieces yourself (Sarvam TTS + Whisper ASR + your own conversation engine + Twilio Streams). Bundled is fast to ship; component gives you control.

Section 1 — Current architecture Twilio + ElevenLabs + OpenAI + Onion

This is what's deployed right now and answers your phone every time someone dials +1 (234) 562-2962. Five actors talk to each other; ElevenLabs is the bus driver.

Static architecture — who owns what

Components in play

What happens during one call (timeline)

Sequence of events from ring to first agent word

Why it feels real-time

Total perceived turn latency in this stack runs ~600–1200 ms from the moment you stop speaking to the moment you hear the agent reply. The trick is that nothing is sequential — every layer streams:

ASR emits partial transcripts while you speak, not after you finish.
VAD commits the final transcript ~200–400ms after you pause and immediately fires the LLM.
OpenAI streams tokens (50–300ms TTFT) — ElevenLabs doesn't wait for the full reply.
TTS Flash starts synthesising as soon as the first sentence boundary appears in the LLM stream (75ms TTFB).
Connections are persistent WebSockets — no TCP handshake per turn.

Everything overlaps. By the time the LLM has finished generating its reply, the first half of it has already been spoken to the caller.

Tip · pricing line item Twilio US local is ~$1.15/month per number + ~$0.0085/minute inbound. ElevenLabs CAI bills per conversation minute (~$0.08–0.15 depending on tier). OpenAI tokens billed by ElevenLabs to your account.

Info · why ElevenLabs and not raw Twilio + OpenAI? Could you DIY this? Yes. Vapi, Retell, Bland AI all did. The hard parts are turn-taking, interruption handling ("barge-in"), tool-call orchestration, and connection management. ElevenLabs CAI gives you all of it for free.

Watch · webhook latency Onion's conversation-init webhook runs before the agent says anything. If your Mongo query takes 2 seconds, the caller hears 2 seconds of dead air. Aim for <200ms total — single indexed query, no n+1.

Info · the override security toggle ElevenLabs blocks webhook overrides by default. We had to flip enable_conversation_initiation_client_data_from_webhook: true per agent to make V2 routing work at all. Future agents we create get this baked in.

Heads-up · when v3 is needed Older models (eleven_flash_v2) are English-only and reject Malayalam mid-call with "Non retryable error during text to speech". Stay on eleven_v3_conversational for any multi-language work.

Section 2 — Swap Twilio for Exotel India / regional carriers

Twilio is dominant globally but expensive and sometimes regulatory-heavy in India and the Middle East. Exotel (India), Plivo, Sinch (MENA), and Tata Communications (enterprise India) all do the same job: rent a number, connect calls. Architecturally they slot in cleanly.

What changes when Twilio becomes Exotel

Same five actors, different telephony provider

The catch — ElevenLabs' phone number import only natively supports Twilio

When we ran POST /api/twilio/link-agent, ElevenLabs took ownership of the Twilio number's voice URL behind the scenes. They also accept SIP trunk registration, but they don't have a one-click "import from Exotel" button. Two ways around this:

SIP trunking (preferred): configure Exotel to forward inbound calls via SIP to ElevenLabs' SIP endpoint. ElevenLabs then thinks of it as just another inbound SIP call. Audio path stays clean: Exotel → SIP → ElevenLabs → OpenAI. No bridge code needed.
WebSocket bridge: Exotel's Voicebot Passthru feature streams audio over their WS protocol to your URL. You'd run a small Node service that translates Exotel's WS frames into ElevenLabs' WS frames. ~300 lines, but introduces a hop you'd rather not have.

Practical comparison

	Twilio	Exotel
US local number cost	$1.15/mo	n/a (US not their market)
India local number cost	~$5/mo + KYC bundle hassle	~₹500/mo, India-native KYC
Onboarding for India	Regulatory bundle paperwork (weeks)	Same-day
Latency to Indian callers	120–250ms (US edge)	30–80ms (Mumbai edge)
ElevenLabs native integration	Built-in	SIP trunk or DIY bridge
API quality	Excellent docs, Twilio CLI	OK docs, no CLI
DLT registration (India outbound)	Pre-built compliance	You handle it (templates available)
Best for	US, UK, EU	India inbound, lower cost, lower latency

What stays the same

Critically: ElevenLabs, OpenAI, and your Onion code are all unchanged. The conversation-init webhook fires identically. Your number_mappings still work — same caller_id format, same response format. The whole right-hand side of the architecture diagram doesn't move.

This is the value of designing telephony as an adapter: you switch the provider, not the product.

Tip · architecture rule Treat telephony as a swappable adapter from day one. lib/telephony/twilio.js today, lib/telephony/exotel.js tomorrow. The interface is small: search, buy, assign-agent, release.

Watch · India compliance TRAI / DLT rules apply to outbound calling from Indian numbers. Inbound (which Onion does) is much simpler but still requires KYC on the number itself. Exotel handles this in their dashboard; Twilio India needs a regulatory bundle filed.

Info · the SIP option SIP trunking lets any telephony provider connect to ElevenLabs without a custom bridge. ElevenLabs supports inbound SIP today. Exotel, Plivo, Tata, even on-prem PBX — all can route calls via SIP.

Heads-up · CLI gap Twilio has a great CLI. Exotel doesn't. If you build the Exotel adapter, you'll do everything via REST API only — buying numbers, listing them, assigning webhooks. Plan for that DX gap.

Section 3 — Sarvam for voice, ElevenLabs for everything else The real story of "swap just the voice"

Honest correction first. ElevenLabs Conversational AI does not support a "custom TTS endpoint" — TTS is the core IP they're selling. They do support custom LLM endpoints, but not custom TTS or ASR. So "use Sarvam for voice while staying on ElevenLabs CAI" isn't possible as a clean drop-in. Section 3 walks through what's actually achievable.

Two real options when ElevenLabs voice quality isn't good enough for a language

Option A — Stay on ElevenLabs, train a voice clone in the target language. Upload 30 minutes of clean Malayalam recordings from a native speaker, ElevenLabs trains a Professional Voice Clone, you get a voice_id. The TTS engine is still ElevenLabs' multilingual model (eleven_v3_conversational), but with that person's voice and accent. Drop the voice_id into your number mapping. Done.

Option B — Leave ElevenLabs CAI only for calls in the problem language; build a DIY pipeline for those. This is what Vapi/Retell built. You replace the entire conversation orchestrator for those specific calls — Sarvam for ASR + TTS, OpenAI for LLM, your own server for state + turn-taking + tool calls. ElevenLabs is bypassed.

Option A is one afternoon. Option B is multiple weeks of infrastructure. Pick A unless quality is genuinely a blocker.

Option A in pictures

Option A — voice cloning, no architecture change

Option B in pictures

If you must use Sarvam for actual TTS+ASR, the architecture splits into two paths. English / non-problematic-language calls keep going through ElevenLabs CAI. Calls flagged for Sarvam-language treatment route to your DIY pipeline:

Option B — split routing, DIY pipeline for Sarvam-language calls

Notice that "DIY pipeline" is doing four jobs ElevenLabs used to do for you in one box: turn-taking, VAD, ASR, and TTS. Plus you have to wire your tool-calling system back in (Onion's create_booking webhook etc.) since ElevenLabs no longer manages it.

Sarvam, AI4Bharat, Bhashini — what's actually available

Engine	What they offer	License
Sarvam AI	Best-in-class Indian-language TTS + ASR. 11 Indian languages. Production API.	Commercial — pay per use
AI4Bharat IIT Madras	Open-source ASR (IndicConformer) + TTS (IndicTTS). 22 Indian languages.	Free, MIT — self-host
Bhashini Govt of India	State-funded translation + TTS + ASR for Indian languages. API access via signup.	Free with limits
ElevenLabs Voice Clone (your existing tool)	Same TTS engine, with a custom voice — works in 32 languages including Malayalam, Hindi, Tamil, Arabic.	Tier-based

Recommendation

Stay on Option A (voice cloning) for the next 6 months. It's a 20-minute experiment per language, not a 6-week build. If — after professional cloning — Malayalam still sounds wrong, the model itself is the limit, and that's when Option B becomes worth the cost. By then you'll also have paying customers in those languages, which is the only good reason to take on the complexity.

Tip · the 20-minute test Find a fluent Malayali friend. Record 5 minutes of varied sentences in a quiet room. ElevenLabs → "Add Voice" → Instant Voice Clone → get the voice_id → onion mappings edit +971... --voice <id>. Should sound dramatically more native.

Info · what voice cloning can't fix Voice cloning gives you the sound of a native speaker. The pronunciation engine underneath is still ElevenLabs' multilingual model. For most Indian/Arabic languages this is fine. For Telugu / Konkani / niche dialects, the model itself may mispronounce regardless of voice — that's the ceiling.

Heads-up · Option B economics Building Option B costs you ~6 weeks of engineering. ElevenLabs CAI's voice synthesis is ~$0.10/min. Sarvam is similar. The marginal cost saving is small unless you're at scale. The reason to go Option B is quality, not cost — only do it if voice cloning genuinely doesn't get you native-sounding output.

Watch · the orchestration trap Most "we'll just use Sarvam" plans fail because people underestimate orchestration. ASR + TTS are the easy parts. Turn-taking, interrupt handling, tool-call latency budgets — that's where months go. Use LiveKit Agents as a starting framework if you commit to Option B.

Tip · hybrid is fine Routing policy can be: "English calls → ElevenLabs CAI; Malayalam calls → DIY Sarvam pipeline." Same Onion. Same Twilio number. Different downstream stacks. The TwiML router decides at call answer time based on the caller's mapping.