Voice Integration Approaches - Product Requirements Document¶
This document describes the different approaches for adding voice capabilities to AI agents, with a focus on integration options for ADK-based agents (like stats-agent-team).
Overview¶
There are two primary approaches to voice-enabling AI agents:
| Approach | Description | Control Level | Best For |
|---|---|---|---|
| Managed Platform | Use a voice AI platform (ElevenLabs ConvAI, Vapi, Retell) | Medium | Quick deployment, simpler setup |
| Custom Pipeline | Build your own STT → LLM → TTS pipeline | Full | Complex workflows, multi-agent systems |
Approach A: ElevenLabs Conversational AI¶
ElevenLabs offers a fully-managed Conversational AI platform that handles voice interactions end-to-end.
Architecture¶
┌────────────────────────────────────────────────────────────────────────┐
│ ElevenLabs Conversational AI │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ User │───►│ STT │───►│ LLM │───►│ TTS │ │
│ │ (Voice) │◄───│(Managed) │◄───│ (Custom) │◄───│ (ElevenLabs) │ │
│ └──────────┘ └──────────┘ └────┬─────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Webhook Tools │ │
│ │ MCP Servers │ │
│ └────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
Custom LLM Support¶
ElevenLabs ConvAI supports custom LLM endpoints, enabling integration with your own agents:
// Configuration for custom LLM (from go-elevenlabs)
type CustomLLM struct {
URL string // Your Chat Completions-compatible endpoint
ModelID string // Optional model identifier
APIKey string // Authentication (via workspace secrets)
CustomHeaders map[string]string // Additional request headers
APIVersion string // API version string
APIType string // "chat_completions" or "responses"
}
Supported LLM Types:
- Built-in: GPT-4o, Claude 3.5/4.5, Gemini 2.0, Grok, etc.
- Custom: Any OpenAI Chat Completions-compatible endpoint
Tool/Function Calling¶
ElevenLabs ConvAI supports multiple tool calling mechanisms:
1. Webhook Tools
- Make HTTP calls to external endpoints during agent execution
- Full control over request parameters and headers
- Response timeout configuration (5-120 seconds)
- Extract values from tool responses into dynamic variables
2. MCP (Model Context Protocol) Servers
- Direct support for connecting MCP servers
- SSE transport support
- Per-tool approval policies
- Tool configuration overrides
3. Built-in System Tools
- Transfer to agent/number
- Language detection
- Voicemail detection
- DTMF playback
- Call end handling
Integration with ADK Agents¶
Can ElevenLabs ConvAI be used with ADK agents (like stats-agent-team)?
Yes, with two integration patterns:
Pattern 1: Custom LLM Endpoint¶
Expose your ADK agent as a Chat Completions-compatible endpoint:
// Your ADK agent exposes a /chat/completions endpoint
// that wraps the agent's conversation handling
POST /v1/chat/completions
{
"messages": [{"role": "user", "content": "Find statistics about climate change"}],
"stream": true
}
Pros: - ElevenLabs handles all voice complexity - Low latency (optimized voice pipeline) - Simple deployment
Cons: - Limited control over conversation flow - Must adapt ADK agent to Chat Completions API - Tool calling flows through ElevenLabs
Pattern 2: Webhook Tools¶
Keep your ADK agents as-is, call them via webhooks:
// ElevenLabs agent configuration
{
"tools": [{
"type": "webhook",
"name": "research_statistics",
"description": "Research statistics on a topic",
"url": "https://your-server.com/agents/research",
"method": "POST"
}]
}
Pros: - No changes to ADK agents - Use ADK agents as specialized tools - Keep agent expertise modular
Cons: - Higher latency (LLM → Tool → Agent → LLM) - Less conversational (tool call pattern)
Phone Integration¶
ElevenLabs ConvAI includes native phone integration:
// Register incoming Twilio call
resp, _ := client.Twilio().RegisterCall(ctx, &TwilioRegisterCallRequest{
AgentID: "your-agent-id",
FirstMessage: "Hello, how can I help you today?",
})
// Returns TwiML to send to Twilio
// Make outbound call
resp, _ := client.Twilio().OutboundCall(ctx, &TwilioOutboundCallRequest{
AgentID: "your-agent-id",
AgentPhoneNumberID: "your-phone-id",
ToNumber: "+1234567890",
})
Go Library Support¶
The go-elevenlabs library provides full support:
import "github.com/agentplexus/go-elevenlabs"
client := elevenlabs.NewClient(apiKey)
// Access Conversational AI features
client.Twilio() // Phone integration
client.PhoneNumbers() // Phone number management
client.API() // Full ConvAI API access (agents, tools, knowledge bases)
Approach B: Custom Voice Pipeline (OmniVoice)¶
Build your own voice pipeline with full control over each component.
Architecture¶
┌────────────────────────────────────────────────────────────────────────┐
│ OmniVoice Pipeline │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────────────┐ │
│ │ Twilio │ │ Deepgram │ │ ADK Agent │ │
│ │ Media │───►│ STT │───►│ │ │
│ │ Streams │ │ (Streaming) │ │ ┌─────────────────────┐ │ │
│ │ │ └──────────────┘ │ │ Claude + Tools │ │ │
│ │ │ │ │ (Research, Verify) │ │ │
│ │ │ ┌──────────────┐ │ └─────────────────────┘ │ │
│ │ │◄───│ ElevenLabs │◄───│ │ │
│ │ │ │ TTS │ │ │ │
│ └──────────┘ │ (Streaming) │ └───────────────────────────┘ │
│ └──────────────┘ │
│ │
│ Audio Format: mu-law 8kHz ◄──► PCM 16kHz ◄──► Provider formats │
└────────────────────────────────────────────────────────────────────────┘
Component Selection¶
| Component | Recommended | Alternative | Rationale |
|---|---|---|---|
| STT | Deepgram | Whisper (self-hosted) | Lowest latency, streaming |
| TTS | ElevenLabs | Cartesia | Best quality, streaming |
| LLM | Claude | GPT-4 | Best reasoning, tool use |
| Transport | Twilio Media Streams | LiveKit | Production telephony |
Audio Pipeline¶
The audio pipeline uses the omnivoice/audio/codec package:
Inbound: Twilio (mu-law 8kHz) → MulawDecode → Resample → Deepgram (PCM 16kHz)
Outbound: ElevenLabs (ulaw_8000) → Twilio (mu-law 8kHz)
OR
ElevenLabs (pcm_16000) → Resample → MulawEncode → Twilio
Note: ElevenLabs supports native ulaw_8000 output, eliminating conversion overhead.
Integration with ADK Agents¶
Full control over agent integration:
// In your voice handler
func handleVoiceSession(conn *transport.Connection) {
// 1. Receive audio from Twilio
audioIn := conn.AudioIn()
// 2. Transcribe with Deepgram
transcript := sttClient.TranscribeStream(audioIn)
// 3. Process with ADK agent
response := adkAgent.ProcessMessage(ctx, transcript)
// 4. Stream response through TTS
audioOut := ttsClient.SynthesizeStream(response)
// 5. Send to Twilio
conn.AudioOut().Write(audioOut)
}
Comparison Matrix¶
| Criteria | ElevenLabs ConvAI | Custom Pipeline (OmniVoice) |
|---|---|---|
| Setup Time | Hours | Days-Weeks |
| Voice Latency | ~300ms | ~400-600ms (depends on implementation) |
| Voice Quality | Excellent | Excellent (same TTS) |
| Agent Control | Medium (via custom LLM) | Full |
| Multi-Agent Support | Limited (one agent per call) | Full (orchestrate freely) |
| Tool Calling | Via webhooks/MCP | Native ADK tools |
| Conversation Flow | Platform-managed | Custom logic |
| Cost | Higher (platform fee) | Lower (pay per API call) |
| Phone Integration | Built-in | Via Twilio/etc |
Recommendations¶
Use ElevenLabs ConvAI When:¶
- You need voice capabilities quickly
- Your agent fits the single-agent model
- You want managed infrastructure
- Voice quality and latency are critical
- You don't need complex multi-agent orchestration
Use Custom Pipeline When:¶
- You have complex multi-agent workflows (like stats-agent-team)
- You need full control over conversation flow
- You want to use ADK's native tool system
- You have specific latency/cost requirements
- You're building a platform (not a single agent)
Hybrid Approach¶
For stats-agent-team and similar projects, consider a hybrid approach:
- Use ElevenLabs WebSocket TTS for voice synthesis (best quality)
- Use Deepgram for speech recognition (lowest latency)
- Use your ADK agents for the actual work
- OmniVoice orchestrates the pipeline
This gives you the best of both worlds: premium voice quality from ElevenLabs without giving up control of your agent architecture.
Implementation References¶
ElevenLabs ConvAI¶
- Library:
github.com/agentplexus/go-elevenlabs - Services:
Twilio(),PhoneNumbers(),WebSocketTTS() - Docs: ElevenLabs Conversational AI
OmniVoice Custom Pipeline¶
- Audio codecs:
omnivoice/audio/codec(mu-law, A-law, PCM) - Transport:
omnivoice-twilio/transport(Media Streams WebSocket) - TRD:
omnivoice/docs/twilio-integration-trd.md
ADK Agents¶
- Library:
google.golang.org/adk - Example:
github.com/agentplexus/stats-agent-team - Pattern: BaseAgent → LLM Agent → Tools → A2A Protocol