OmniVoice Roadmap¶

This document tracks the implementation roadmap for OmniVoice.

Overview¶

OmniVoice is a voice abstraction layer for AgentPlexus supporting: - TTS (Text-to-Speech) - STT (Speech-to-Text) - Voice Agents (Real-time bidirectional conversation)

Current State¶

Completed (Interfaces Only)¶

Package	Status	Description
`tts/tts.go`	Interface	Provider interface, Client with fallback
`tts/errors.go`	Complete	Error definitions
`stt/stt.go`	Interface	Provider interface, streaming support
`stt/errors.go`	Complete	Error definitions
`agent/agent.go`	Interface	Session, events, tools, metrics
`transport/transport.go`	Interface	WebRTC, SIP, WebSocket protocols
`callsystem/callsystem.go`	Interface	Call/Meeting system integration

Examples (Sketches)¶

Example	Status	Description
`examples/simple-tts/`	Skeleton	Basic TTS with provider fallback
`examples/twilio-agent/`	Skeleton	Twilio ConversationRelay integration
`examples/zoom-agent/`	Skeleton	Zoom via Recall.ai

Implementation Roadmap¶

Phase 1: TTS Providers¶

Priority order based on AgentPlexus use cases (slide decks → LMS videos):

omnivoice/
└── tts/
    ├── elevenlabs/           # Priority 1 - Primary provider
    │   └── elevenlabs.go     # Uses github.com/agentplexus/go-elevenlabs
    ├── google/               # Priority 2 - Wide language support
    │   └── google.go
    ├── polly/                # Priority 3 - AWS integration
    │   └── polly.go
    ├── azure/                # Priority 4 - Enterprise
    │   └── azure.go
    └── cartesia/             # Priority 5 - Low latency
        └── cartesia.go

Dependencies: - github.com/agentplexus/go-elevenlabs (move from grokify first)

Phase 2: STT Providers¶

omnivoice/
└── stt/
    ├── deepgram/             # Priority 1 - Best real-time latency
    │   └── deepgram.go
    ├── whisper/              # Priority 2 - OpenAI, good accuracy
    │   └── whisper.go
    ├── google/               # Priority 3 - Wide language support
    │   └── google.go
    ├── assemblyai/           # Priority 4 - Good features
    │   └── assemblyai.go
    └── azure/                # Priority 5 - Enterprise
        └── azure.go

Phase 3: Transport Layer¶

omnivoice/
└── transport/
    ├── websocket/            # Priority 1 - Most common for voice agents
    │   └── websocket.go
    ├── webrtc/               # Priority 2 - Browser/app integration
    │   └── webrtc.go
    └── http/                 # Priority 3 - Batch processing
        └── http.go

Phase 4: Call System Integrations¶

omnivoice/
└── callsystem/
    ├── twilio/               # Priority 1 - ConversationRelay
    │   └── twilio.go
    ├── recallai/             # Priority 2 - Zoom/Teams/Meet bots
    │   └── recallai.go
    ├── livekit/              # Priority 3 - Open source WebRTC
    │   └── livekit.go
    ├── daily/                # Priority 4 - Simple WebRTC
    │   └── daily.go
    └── ringcentral/          # Priority 5 - Enterprise PBX
        └── ringcentral.go

Phase 5: Voice Agent Providers¶

omnivoice/
└── agent/
    ├── custom/               # Priority 1 - STT + LLM + TTS orchestration
    │   └── custom.go         # Uses OmniLLM for LLM calls
    ├── elevenlabs/           # Priority 2 - ElevenLabs Agents platform
    │   └── elevenlabs.go
    ├── vapi/                 # Priority 3 - Vapi.ai
    │   └── vapi.go
    └── retell/               # Priority 4 - Retell AI
        └── retell.go

Call System Wiring Reference¶

How Voice Agents Connect to Calls¶

Voice agents need a transport layer to receive/send audio:

Use Case	Call System	Protocol	Notes
Phone calls (IVR, call center)	Twilio ConversationRelay	PSTN → WebSocket	Managed, recommended
Business phones	RingCentral	WebRTC/SIP	Native AI Receptionist available
Zoom meetings	Recall.ai	SDK → WebSocket	Avoid building Zoom SDK bot yourself
Teams meetings	Recall.ai	SDK → WebSocket	Same as Zoom
Google Meet	Recall.ai	SDK → WebSocket	Same as Zoom
Custom web app	LiveKit / Daily	WebRTC	Open source options
Mobile app	LiveKit	WebRTC	Cross-platform

Wiring Diagram: Phone Call¶

┌──────────┐        ┌─────────────────┐        ┌───────────────────┐
│  Caller  │◄──────►│     Twilio      │◄──────►│   OmniVoice       │
│  (PSTN)  │  PSTN  │ ConversationRelay│  WS   │   Voice Agent     │
└──────────┘        └─────────────────┘        │                   │
                                               │  STT → LLM → TTS  │
                                               └───────────────────┘

Wiring Diagram: Zoom Meeting¶

┌─────────────────────────────────────────────────────────────────┐
│                        Zoom Meeting                              │
│   ┌─────────┐  ┌─────────┐  ┌─────────────────────────────────┐ │
│   │  User 1 │  │  User 2 │  │     Recall.ai Bot               │ │
│   └─────────┘  └─────────┘  └────────────────┬────────────────┘ │
└──────────────────────────────────────────────┼──────────────────┘
                                               │ WebSocket
                                               ▼
                              ┌─────────────────────────────────┐
                              │        OmniVoice Agent          │
                              │  STT → LLM → TTS                │
                              └─────────────────────────────────┘

Dependencies¶

External SDKs Needed¶

Provider	Go SDK	Status
ElevenLabs	`github.com/agentplexus/go-elevenlabs`	Move from grokify
Deepgram	`github.com/deepgram/deepgram-go-sdk`	Available
OpenAI (Whisper)	`github.com/sashabaranov/go-openai`	Available
Google Cloud	`cloud.google.com/go/speech`	Available
AWS (Polly)	`github.com/aws/aws-sdk-go-v2`	Available
Azure	`github.com/Azure/azure-sdk-for-go`	Available
Twilio	`github.com/twilio/twilio-go`	Available
LiveKit	`github.com/livekit/server-sdk-go`	Available
Recall.ai	None (REST API)	Build client

Internal Dependencies¶

Package	Depends On
`tts/elevenlabs`	`go-elevenlabs`
`agent/custom`	`OmniLLM`, `tts/`, `stt/`
`callsystem/twilio`	`transport/websocket`
`callsystem/recallai`	`transport/websocket`

Agent Framework Integration¶

OmniVoice can be used with Go-based agent frameworks:

Framework	Integration Point
Eino	Use OmniVoice for voice I/O in Eino agents
Google ADK	Bidirectional audio streaming support
Custom	Direct STT/TTS/Agent interfaces

Latency Targets¶

For natural conversation:

Metric	Target	Acceptable	Poor
Total round-trip	< 500ms	< 1000ms	> 1500ms
STT latency	< 200ms	< 300ms	> 500ms
LLM latency	< 300ms	< 500ms	> 1000ms
TTS latency	< 150ms	< 250ms	> 400ms

Next Steps¶

Move go-elevenlabs to github.com/agentplexus/go-elevenlabs
Implement tts/elevenlabs using the moved SDK
Implement stt/deepgram for real-time transcription
Implement transport/websocket for streaming
Implement callsystem/twilio for phone call integration
Implement agent/custom to orchestrate STT + OmniLLM + TTS