OmniVoice¶
Voice abstraction layer for AgentPlexus supporting TTS, STT, and Voice Agents across multiple providers and transport protocols.
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ OmniVoice │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────────────┐ │
│ │ TTS │ │ STT │ │ Voice Agent │ │
│ │ │ │ │ │ │ │
│ │ Text → Audio│ │ Audio → Text│ │ Real-time bidirectional voice │ │
│ └──────┬──────┘ └──────┬──────┘ └───────────────┬─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Provider Layer │ │
│ ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤ │
│ │ ElevenLabs │ Deepgram │ Google Cloud│ AWS │ Azure │ │
│ │ Cartesia │ Whisper │ AssemblyAI │ Polly │ Speech │ │
│ └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Transport Layer │ │
│ ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤ │
│ │ WebRTC │ SIP │ PSTN │ WebSocket │ HTTP │ │
│ └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Call System Integration │ │
│ ├─────────────┬─────────────┬─────────────┬─────────────┬─────────────┤ │
│ │ Twilio │ RingCentral │ Zoom │ LiveKit │ Daily │ │
│ └─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Features¶
- Provider Agnostic - Swap TTS/STT providers without code changes
- Modular Architecture - Use only the layers you need
- Production Ready - Designed for real-time, low-latency voice applications
- Full Stack - From phone calls to audio processing
Package Structure¶
omnivoice/
├── tts/ # Text-to-Speech
│ ├── tts.go # Interface definitions
│ └── providertest/ # Conformance test suite
│
├── stt/ # Speech-to-Text
│ ├── stt.go # Interface definitions
│ └── providertest/ # Conformance test suite
│
├── agent/ # Voice Agent orchestration
│ ├── agent.go # Interface definitions
│ └── session.go # Conversation session management
│
├── transport/ # Audio transport protocols
│ └── transport.go # Interface definitions
│
├── callsystem/ # Call system integrations
│ └── callsystem.go # Interface definitions
│
├── audio/ # Audio codec utilities
│ └── codec/ # PCM, mu-law, a-law
│
├── subtitle/ # Subtitle generation
│ └── subtitle.go # SRT/VTT from transcription results
│
├── mcp/ # MCP server for voice interactions
│ └── server.go
│
└── pipeline/ # Pipeline components
└── pipeline.go # STT/TTS/Transport pipelines
Quick Start¶
import (
"context"
"github.com/agentplexus/omnivoice/tts"
)
// Create a provider (e.g., ElevenLabs)
provider, err := elevenlabs.New(elevenlabs.WithAPIKey(apiKey))
if err != nil {
log.Fatal(err)
}
// Synthesize speech
result, err := provider.Synthesize(ctx, "Hello, world!", tts.SynthesisConfig{
VoiceID: "voice-id",
OutputFormat: "mp3",
})
Use Case Recommendations¶
| Use Case | Call System | Transport | Notes |
|---|---|---|---|
| IVR / Call Center | Twilio ConversationRelay | PSTN/SIP | Best managed solution |
| Business Phone | RingCentral | WebRTC/SIP | Native AI Receptionist available |
| Custom Web App | LiveKit or Daily | WebRTC | Open source, flexible |
| Zoom Meetings | Recall.ai + Zoom | SDK → WebSocket | Avoid building Zoom bot yourself |
| Browser Widget | Direct WebSocket | WebSocket | ElevenLabs widget or custom |
| Mobile App | LiveKit | WebRTC | Cross-platform support |
Latency Targets¶
For natural conversation, total round-trip latency should be under 500ms:
| Metric | Target | Acceptable | Poor |
|---|---|---|---|
| Total round-trip | < 500ms | < 1000ms | > 1500ms |
| STT latency | < 200ms | < 300ms | > 500ms |
| LLM latency | < 300ms | < 500ms | > 1000ms |
| TTS latency | < 150ms | < 250ms | > 400ms |