Onion · Architecture Notes

How a phone call actually works End-to-end across Twilio, ElevenLabs, OpenAI, and the Onion conversation-init webhook — and how to swap providers without breaking the rest.

Glossary Acronyms unpacked

A pile of three-letter words gets thrown around in voice-AI architecture. None of them are mysterious — they're just the layers in the audio pipeline.

ASR — Automatic Speech Recognition
Converts a stream of audio (the caller's voice) into text. Examples: ElevenLabs Scribe, OpenAI Whisper, Google Speech-to-Text, Sarvam ASR. The faster ASR detects "the user stopped talking" the lower the call latency.
TTS — Text-to-Speech
The opposite direction: takes text and synthesises spoken audio. ElevenLabs' core product is industry-best TTS. Sarvam and Bhashini are India-focused TTS engines.
LLM — Large Language Model
The "brain" deciding what to say next. GPT-4o, Claude, Llama. ElevenLabs Conversational AI wraps an LLM and feeds it the running transcript turn by turn.
VAD — Voice Activity Detection
A small model that decides "is the caller speaking right now or just paused?" Used by ASR to know when to commit a transcript and trigger the LLM. Tuning VAD timeouts is critical for low-latency conversation.
TTFT / TTFB — Time to First Token / Byte
How fast a streaming model starts emitting output. ElevenLabs Flash hits ~75ms TTFB on TTS. OpenAI streaming hits ~50–300ms TTFT on the LLM. Stack them and you can keep total turn latency under a second.
PSTN — Public Switched Telephone Network
The traditional global phone system. When someone dials your Twilio number from their mobile, the call traverses PSTN until it hits Twilio's edge.
SIP — Session Initiation Protocol
The signalling protocol that lets internet-side services (Twilio, Exotel, ElevenLabs) accept and route phone calls. ElevenLabs imports your Twilio number behind the scenes by registering a SIP endpoint.
TwiML — Twilio Markup Language
An XML-based instruction format. When Twilio receives an inbound call, it asks "what should I do?" and your webhook returns TwiML like <Connect><Stream url="wss://..."/></Connect> to open a media stream to ElevenLabs (or wherever).
WSS — WebSocket Secure
A persistent, bidirectional, low-latency channel between Twilio and the AI provider. Audio chunks flow both ways for the entire duration of the call. No HTTP request/response per turn — that would be way too slow.
WebRTC
A peer-to-peer browser-friendly real-time audio/video standard. ElevenLabs uses it for the in-browser voice widget. Different from PSTN/SIP — that's the "phone" path; WebRTC is the "browser" path.
Conversation Initiation Webhook
An HTTP endpoint you host (Onion has one at /api/elevenlabs/conversation-init) that ElevenLabs hits before each call begins. You return per-call overrides (language, first message, prompt, voice) based on who's calling. This is how Onion does multi-language routing per caller number.
Bundled vs Component voice provider
Bundled = one vendor handles ASR + LLM-orchestration + TTS + telephony bridge (ElevenLabs CAI, Vapi, Retell). Component = you assemble individual pieces yourself (Sarvam TTS + Whisper ASR + your own conversation engine + Twilio Streams). Bundled is fast to ship; component gives you control.

Section 1 — Current architecture Twilio + ElevenLabs + OpenAI + Onion

This is what's deployed right now and answers your phone every time someone dials +1 (234) 562-2962. Five actors talk to each other; ElevenLabs is the bus driver.

Static architecture — who owns what

Components in play
📞 Caller any phone any country Twilio telephony provider owns the number handles PSTN ↔ internet ElevenLabs CAI conversation orchestrator ASR (Scribe) TTS (eleven_v3_conversational) turn-taking + state native Twilio integration OpenAI the brain GPT-4o streaming completions Onion per-tenant logic conversation-init webhook number mappings DB MongoDB number_mappings 1. PSTN dial 2. WSS audio 3. streaming LLM 4. POST caller_id 6. overrides 5. lookup

What happens during one call (timeline)

Sequence of events from ring to first agent word
📞 Caller Twilio ElevenLabs CAI Onion webhook OpenAI 1 dials +1 234 562 2962 2 inbound_call (SIP / TwiML) 3 TwiML: Connect to Stream wss URL 4 WSS audio chunks flowing audio is now bidirectional for the whole call 5 POST /api/elevenlabs/conversation-init { caller_id, agent_id, called_number } 6 lookup mapping (Mongo, ~30ms) 7 returns conversation_config_override language: ar, first_message, prompt 8 TTS first_message (Arabic, v3) 9 audio chunks of greeting 10 caller hears the greeting loop — each conversational turn speaks audio chunks (50ms each) ↻ ASR streams words; VAD detects end-of-utterance POST /chat/completions (streaming) tokens — 50ms TTFT, then a flood TTS audio per sentence → caller hears agent caller hangs up — stream closes — post-call analysis

Why it feels real-time

Total perceived turn latency in this stack runs ~600–1200 ms from the moment you stop speaking to the moment you hear the agent reply. The trick is that nothing is sequential — every layer streams:

  • ASR emits partial transcripts while you speak, not after you finish.
  • VAD commits the final transcript ~200–400ms after you pause and immediately fires the LLM.
  • OpenAI streams tokens (50–300ms TTFT) — ElevenLabs doesn't wait for the full reply.
  • TTS Flash starts synthesising as soon as the first sentence boundary appears in the LLM stream (75ms TTFB).
  • Connections are persistent WebSockets — no TCP handshake per turn.

Everything overlaps. By the time the LLM has finished generating its reply, the first half of it has already been spoken to the caller.

Section 2 — Swap Twilio for Exotel India / regional carriers

Twilio is dominant globally but expensive and sometimes regulatory-heavy in India and the Middle East. Exotel (India), Plivo, Sinch (MENA), and Tata Communications (enterprise India) all do the same job: rent a number, connect calls. Architecturally they slot in cleanly.

What changes when Twilio becomes Exotel

Same five actors, different telephony provider
📞 Caller India / SE Asia Exotel telephony — India owns +91 number India PSTN edge DLT-registered ⚡ NEW Onion SIP bridge glue layer receives Exotel passthru opens WS to ElevenLabs bridges audio ⚡ NEW ElevenLabs CAI unchanged same Scribe / v3 same agents OpenAI unchanged GPT-4o Onion webhook unchanged conversation-init same number_mappings +91 dial HTTP passthru OR SIP trunk WSS audio streaming LLM conversation-init ⚡ Pink = swapped or new components. Brown / green / blue = unchanged from the Twilio architecture. Right side of the diagram doesn't move when you swap the carrier.

The catch — ElevenLabs' phone number import only natively supports Twilio

When we ran POST /api/twilio/link-agent, ElevenLabs took ownership of the Twilio number's voice URL behind the scenes. They also accept SIP trunk registration, but they don't have a one-click "import from Exotel" button. Two ways around this:

  1. SIP trunking (preferred): configure Exotel to forward inbound calls via SIP to ElevenLabs' SIP endpoint. ElevenLabs then thinks of it as just another inbound SIP call. Audio path stays clean: Exotel → SIP → ElevenLabs → OpenAI. No bridge code needed.
  2. WebSocket bridge: Exotel's Voicebot Passthru feature streams audio over their WS protocol to your URL. You'd run a small Node service that translates Exotel's WS frames into ElevenLabs' WS frames. ~300 lines, but introduces a hop you'd rather not have.

Practical comparison

TwilioExotel
US local number cost$1.15/mon/a (US not their market)
India local number cost~$5/mo + KYC bundle hassle~₹500/mo, India-native KYC
Onboarding for IndiaRegulatory bundle paperwork (weeks)Same-day
Latency to Indian callers120–250ms (US edge)30–80ms (Mumbai edge)
ElevenLabs native integrationBuilt-inSIP trunk or DIY bridge
API qualityExcellent docs, Twilio CLIOK docs, no CLI
DLT registration (India outbound)Pre-built complianceYou handle it (templates available)
Best forUS, UK, EUIndia inbound, lower cost, lower latency

What stays the same

Critically: ElevenLabs, OpenAI, and your Onion code are all unchanged. The conversation-init webhook fires identically. Your number_mappings still work — same caller_id format, same response format. The whole right-hand side of the architecture diagram doesn't move.

This is the value of designing telephony as an adapter: you switch the provider, not the product.

Section 3 — Sarvam for voice, ElevenLabs for everything else The real story of "swap just the voice"

Honest correction first. ElevenLabs Conversational AI does not support a "custom TTS endpoint" — TTS is the core IP they're selling. They do support custom LLM endpoints, but not custom TTS or ASR. So "use Sarvam for voice while staying on ElevenLabs CAI" isn't possible as a clean drop-in. Section 3 walks through what's actually achievable.

Two real options when ElevenLabs voice quality isn't good enough for a language

Option A — Stay on ElevenLabs, train a voice clone in the target language. Upload 30 minutes of clean Malayalam recordings from a native speaker, ElevenLabs trains a Professional Voice Clone, you get a voice_id. The TTS engine is still ElevenLabs' multilingual model (eleven_v3_conversational), but with that person's voice and accent. Drop the voice_id into your number mapping. Done.

Option B — Leave ElevenLabs CAI only for calls in the problem language; build a DIY pipeline for those. This is what Vapi/Retell built. You replace the entire conversation orchestrator for those specific calls — Sarvam for ASR + TTS, OpenAI for LLM, your own server for state + turn-taking + tool calls. ElevenLabs is bypassed.

Option A is one afternoon. Option B is multiple weeks of infrastructure. Pick A unless quality is genuinely a blocker.

Option A in pictures

Option A — voice cloning, no architecture change
step 1 — one-time setup (training the voice) 🎙️ 30 min Malayalam recording native speaker, clean audio ElevenLabs Voice Clone trains in minutes voice_id pro_1abc… Onion number_mapping +971… → voiceId override stored in MongoDB upload outputs paste step 2 — runtime: every Malayalam call uses the cloned voice (architecture unchanged) 📞 +971 Malayalam caller Twilio unchanged ElevenLabs CAI unchanged orchestration but now uses cloned voice for this caller OpenAI unchanged Onion webhook returns voiceId override at call start conversation-init cloned voiceId fetched at call time

Option B in pictures

If you must use Sarvam for actual TTS+ASR, the architecture splits into two paths. English / non-problematic-language calls keep going through ElevenLabs CAI. Calls flagged for Sarvam-language treatment route to your DIY pipeline:

Option B — split routing, DIY pipeline for Sarvam-language calls
📞 Caller Twilio unchanged TwiML router NEW language by caller? ElevenLabs CAI English & v3-supported languages unchanged path Onion DIY pipeline (NEW) Audio bridge WS server YOU build this Sarvam ASR + TTS Indian languages OpenAI same brain Conv. state turn-taking, VAD, interrupt, tools English / supported Malayalam / Tamil / Telugu Pink components are NEW work. Top branch (ElevenLabs) keeps doing its job for English; bottom branch is what you build for Sarvam-language calls.

Notice that "DIY pipeline" is doing four jobs ElevenLabs used to do for you in one box: turn-taking, VAD, ASR, and TTS. Plus you have to wire your tool-calling system back in (Onion's create_booking webhook etc.) since ElevenLabs no longer manages it.

Sarvam, AI4Bharat, Bhashini — what's actually available

EngineWhat they offerLicense
Sarvam AI Best-in-class Indian-language TTS + ASR. 11 Indian languages. Production API. Commercial — pay per use
AI4Bharat IIT Madras Open-source ASR (IndicConformer) + TTS (IndicTTS). 22 Indian languages. Free, MIT — self-host
Bhashini Govt of India State-funded translation + TTS + ASR for Indian languages. API access via signup. Free with limits
ElevenLabs Voice Clone (your existing tool) Same TTS engine, with a custom voice — works in 32 languages including Malayalam, Hindi, Tamil, Arabic. Tier-based

Recommendation

Stay on Option A (voice cloning) for the next 6 months. It's a 20-minute experiment per language, not a 6-week build. If — after professional cloning — Malayalam still sounds wrong, the model itself is the limit, and that's when Option B becomes worth the cost. By then you'll also have paying customers in those languages, which is the only good reason to take on the complexity.