Twilio Voice Integration - Technical Requirements Document¶

This document describes the architecture options for integrating OmniVoice with Twilio for production voice calls.

Overview¶

Twilio provides multiple approaches for building voice AI applications. Each has different tradeoffs for latency, control, and complexity.

Architecture Options¶

Option A: TwiML-Based TTS via UpdateCall¶

How it works:

┌─────────┐         ┌─────────────┐         ┌───────────────┐
│  User   │◄───────►│   Twilio    │◄───────►│  OmniVoice    │
│ (Phone) │  PSTN   │   Voice     │  REST   │               │
└─────────┘         └─────────────┘  API    └───────────────┘

Call is initiated with TwiML containing <Say> or <Gather>
For each AI response, call UpdateCall API with new TwiML
Twilio plays audio, then returns control for next turn

Pros: - Simple implementation - Uses Twilio's built-in TTS voices (Polly, Google) - No audio format conversion needed

Cons: - High latency - Each turn requires REST API call (~200-500ms overhead) - Not full-duplex - Turn-based, user cannot interrupt - Poor UX - Feels robotic, not conversational

Best for: Simple IVR menus, non-conversational flows

Option B: Media Streams with External TTS/STT (Recommended)¶

How it works:

┌─────────┐         ┌─────────────┐         ┌───────────────────────────────┐
│  User   │◄───────►│   Twilio    │◄───────►│         OmniVoice             │
│ (Phone) │  PSTN   │   Media     │WebSocket│                               │
└─────────┘         │   Streams   │ (μ-law) │  ┌─────┐  ┌─────┐  ┌─────┐   │
                    └─────────────┘         │  │ STT │  │ LLM │  │ TTS │   │
                                            │  └──┬──┘  └──┬──┘  └──┬──┘   │
                                            │     └────────┴────────┘      │
                                            └───────────────────────────────┘

Call connects to Media Streams via WebSocket
Raw audio (mu-law, 8kHz) flows bidirectionally
Audio from user → STT provider (Deepgram, Whisper)
Transcript → LLM for response
Response → TTS provider (ElevenLabs) → audio back to call

Pros: - True full-duplex - Natural conversation flow - Low latency - Streaming audio, no API call overhead - Interruption support - User can interrupt AI mid-sentence - Voice quality - Use premium TTS (ElevenLabs, Cartesia) - Production-grade - How Vapi, Retell, Bland.ai work

Cons: - More complex implementation - Requires audio format conversion (mu-law ↔ PCM/MP3) - Need external TTS/STT providers

Best for: Production voice AI, conversational agents, customer service

Option C: Twilio ConversationRelay¶

How it works:

┌─────────┐         ┌─────────────────────┐         ┌───────────────┐
│  User   │◄───────►│ Twilio Conversation │◄───────►│  OmniVoice    │
│ (Phone) │  PSTN   │       Relay         │WebSocket│   (Text)      │
└─────────┘         │  ┌─────┐  ┌─────┐   │  JSON   └───────────────┘
                    │  │ STT │  │ TTS │   │
                    │  └─────┘  └─────┘   │
                    └─────────────────────┘

Twilio handles STT and TTS internally
Agent receives text transcriptions via WebSocket
Agent sends text responses back
Twilio converts to speech and plays to user

Pros: - Simpler implementation (text-only interface) - Twilio handles audio complexity - Built-in voice activity detection

Cons: - Less control - Limited TTS voice options - Higher latency - Additional Twilio processing - Limited customization - Can't use custom TTS/STT

Best for: Quick prototypes, simple use cases

Option D: ElevenLabs Conversational AI (Managed Platform)¶

How it works:

┌─────────┐         ┌─────────────┐         ┌───────────────────────────────┐
│  User   │◄───────►│   Twilio    │◄───────►│  ElevenLabs Conversational AI │
│ (Phone) │  PSTN   │   (Phone)   │  TwiML  │                               │
└─────────┘         └─────────────┘         │  ┌─────┐  ┌─────┐  ┌─────┐   │
                                            │  │ STT │  │ LLM │  │ TTS │   │
                                            │  └──┬──┘  └──┬──┘  └──┬──┘   │
                                            │     │   Managed │     │      │
                                            │     └────────┴───────┘       │
                                            │            │                 │
                                            │            ▼                 │
                                            │  ┌─────────────────────┐     │
                                            │  │ Custom LLM Endpoint │     │
                                            │  │ (Your ADK Agent)    │     │
                                            │  └─────────────────────┘     │
                                            └───────────────────────────────┘

Twilio call connects to ElevenLabs via TwiML
ElevenLabs handles STT, TTS, and conversation orchestration
LLM can be built-in (GPT-4, Claude) or custom endpoint (your agent)
Tool calling via webhooks or MCP servers

Custom LLM Support: - Point to any OpenAI Chat Completions-compatible endpoint - Use your own ADK agents (Claude + tools) as the brain - ElevenLabs handles all voice complexity

Pros: - Fastest deployment - Hours instead of weeks - Premium voice quality - ElevenLabs TTS built-in - Low latency - Optimized voice pipeline (~300ms) - Custom LLM support - Use your own agents via endpoint - Tool calling - Webhooks and MCP server support

Cons: - Less control - Platform manages conversation flow - Single agent model - One agent per call (no multi-agent orchestration) - Platform dependency - Tied to ElevenLabs infrastructure - Cost - Platform fee on top of API costs

Best for: Single-agent voice apps, rapid prototyping, when voice quality is priority

Go SDK: github.com/agentplexus/go-elevenlabs - Full ConvAI support including: - client.Twilio().RegisterCall() - Register incoming calls - client.Twilio().OutboundCall() - Make outbound calls - Custom LLM configuration via agent settings

Comparison Matrix¶

Criteria	Option A (TwiML)	Option B (Media Streams)	Option C (ConversationRelay)	Option D (ElevenLabs ConvAI)
Latency	High (500ms+)	Low (100-200ms)	Medium (200-400ms)	Low (~300ms)
Full-duplex	No	Yes	Partial	Yes
Interruption	None	Excellent	Good	Excellent
Voice quality	Twilio voices	Any TTS provider	Twilio voices	ElevenLabs (excellent)
Implementation	Simple	Complex	Medium	Simple
Control	Full	Full	Medium	Medium (custom LLM)
Multi-agent	Yes	Yes	Yes	Limited
Cost	Low	Medium	Low	Higher (platform fee)
Production-ready	No	Yes	Limited	Yes

Decision: Option B (Primary) with Option D (Alternative)¶

We choose Option B: Media Streams with External TTS/STT as the primary approach, with Option D: ElevenLabs ConvAI as a supported alternative.

Why Option B (Primary)¶

Production quality - This is how successful voice AI companies (Vapi, Retell, Bland.ai) implement their systems
User experience - True full-duplex enables natural conversations
Flexibility - Can use best-in-class TTS (ElevenLabs) and STT (Deepgram)
Latency - Critical for voice; streaming minimizes delays
Multi-agent support - Full control over agent orchestration (critical for systems like stats-agent-team)
Future-proof - Architecture supports advanced features (voice cloning, emotion detection)

Why Option D (Alternative)¶

Rapid deployment - Get voice working in hours, not weeks
Premium voice - ElevenLabs TTS quality without implementation effort
Custom LLM support - Can still use your own Claude/ADK agents via custom endpoint
Single-agent use cases - Perfect when you don't need multi-agent orchestration

When to Use Which¶

Use Case	Recommended Option
Multi-agent systems (stats-agent-team)	Option B
Complex conversation flows	Option B
Single-agent voice assistant	Option D
Rapid prototyping	Option D
Maximum voice quality with minimal effort	Option D
Full control over audio pipeline	Option B

Implementation Requirements¶

Audio Pipeline¶

┌──────────────────────────────────────────────────────────────────────────┐
│                        Audio Pipeline (Option B)                         │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INBOUND (User → AI)                                                     │
│  ┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────┐  │
│  │ Twilio      │───►│ μ-law 8kHz   │───►│ PCM 16kHz   │───►│   STT   │  │
│  │ Media Stream│    │ (WebSocket)  │    │ (resample)  │    │(Deepgram│  │
│  └─────────────┘    └──────────────┘    └─────────────┘    └────┬────┘  │
│                                                                  │       │
│                                                                  ▼       │
│                                                           ┌───────────┐  │
│                                                           │ Transcript│  │
│                                                           └─────┬─────┘  │
│                                                                 │        │
│  OUTBOUND (AI → User)                                           ▼        │
│  ┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────┐  │
│  │ Twilio      │◄───│ μ-law 8kHz   │◄───│ PCM/MP3     │◄───│   TTS   │  │
│  │ Media Stream│    │ (WebSocket)  │    │ (convert)   │    │(Eleven- │  │
│  └─────────────┘    └──────────────┘    └─────────────┘    │  Labs)  │  │
│                                                            └────┬────┘  │
│                                                                 │        │
│                                                                 ▲        │
│                                                           ┌─────┴─────┐  │
│                                                           │    LLM    │  │
│                                                           │ (Claude)  │  │
│                                                           └───────────┘  │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Components Needed¶

Audio Format Converter
mu-law (8kHz, 8-bit) ↔ PCM (16kHz, 16-bit)
PCM ↔ MP3/WAV for TTS providers

Note: mu-law (G.711 μ-law) is an ITU-T standard, not Twilio-specific. The codec implementation lives in omnivoice/audio/codec for reuse across telephony providers:

Provider/System	Audio Format
Twilio Media Streams	mu-law 8kHz
RingCentral	mu-law/A-law
Most SIP systems	G.711 (mu-law/A-law)
PSTN (North America, Japan)	mu-law
PSTN (Europe)	A-law
FreeSWITCH	G.711

omnivoice/
└── audio/
    ├── codec/
    │   ├── mulaw.go      # G.711 μ-law (North America, Japan)
    │   ├── alaw.go       # G.711 A-law (Europe)
    │   └── pcm.go        # PCM utilities
    └── resample/
        └── resample.go   # Sample rate conversion (8kHz ↔ 16kHz)

Streaming STT Integration
Deepgram (recommended) or Whisper
Real-time transcription with interim results
Streaming TTS Integration
ElevenLabs (recommended) for quality
Cartesia for low latency
Must support streaming output
Voice Activity Detection (VAD)
Detect when user starts/stops speaking
Enable AI interruption handling
Turn Management
Track conversation state
Handle overlapping speech
Manage barge-in (user interrupts AI)

Provider Recommendations¶

Component	Primary	Fallback	Rationale
STT	Deepgram	Whisper (self-hosted)	Lowest latency, excellent accuracy
TTS	ElevenLabs	Cartesia	Best voice quality
LLM	Claude	GPT-4	Best reasoning, tool use

Latency Budget¶

Target: < 500ms end-to-end for natural conversation

Component	Target	Notes
STT	100-150ms	Deepgram streaming
LLM	200-300ms	Claude streaming
TTS	100-150ms	ElevenLabs streaming
Network	50ms	Twilio to server
Total	450-650ms	Acceptable for voice

Implementation Phases¶

Phase 1: Audio Pipeline Foundation ✅¶

[x] Implement mu-law codec (omnivoice/audio/codec/mulaw.go)
[x] Implement A-law codec (omnivoice/audio/codec/alaw.go)
[x] Implement PCM utilities (omnivoice/audio/codec/pcm.go)
[x] Add codec tests with 100% pass rate
[ ] Connect Media Streams to transport layer
[ ] Pipe audio to/from transport

Phase 2: TTS Integration (ElevenLabs)¶

Priority: Implement first since go-elevenlabs SDK is ready.

[ ] Create TTS provider interface in omnivoice/tts/
[ ] Implement ElevenLabs streaming TTS provider
[ ] Support native ulaw_8000 output (no conversion needed)
[ ] Fallback: PCM output with mu-law conversion
[ ] Connect TTS to outbound transport

Key Feature: ElevenLabs WebSocket TTS supports native ulaw_8000 output format, eliminating the need for audio conversion on the outbound path.

LLM Response → ElevenLabs WebSocket TTS (ulaw_8000) → Twilio Media Streams
                        ↑
              No conversion needed!

Phase 3: STT Integration (Deepgram)¶

[ ] Create STT provider interface in omnivoice/stt/
[ ] Implement Deepgram streaming STT provider
[ ] Connect inbound audio to STT (mu-law → PCM → Deepgram)
[ ] Handle interim and final transcripts
[ ] Add transcript event channel

Twilio Voice Integration - Technical Requirements Document¶

Overview¶

Architecture Options¶

Option A: TwiML-Based TTS via UpdateCall¶

Option B: Media Streams with External TTS/STT (Recommended)¶

Option C: Twilio ConversationRelay¶

Option D: ElevenLabs Conversational AI (Managed Platform)¶

Comparison Matrix¶

Decision: Option B (Primary) with Option D (Alternative)¶

Why Option B (Primary)¶

Why Option D (Alternative)¶

When to Use Which¶

Implementation Requirements¶

Audio Pipeline¶

Components Needed¶

Provider Recommendations¶

Latency Budget¶

Implementation Phases¶

Phase 1: Audio Pipeline Foundation ✅¶

Phase 2: TTS Integration (ElevenLabs)¶

Phase 3: STT Integration (Deepgram)¶

Phase 4: Conversation Management¶

Phase 5: Production Hardening¶

References¶