Skip to content

WebSocket TTS

Real-time text-to-speech streaming via WebSocket for low-latency voice synthesis.

Overview

The WebSocket TTS service enables streaming text to speech in real-time, making it ideal for:

  • LLM Integration: Stream text from language models as it's generated
  • Interactive Applications: Voice assistants, chatbots, real-time narration
  • Low Latency: Get audio output before the full text is available

Basic Usage

// Connect to WebSocket TTS
conn, err := client.WebSocketTTS().Connect(ctx, voiceID, nil)
if err != nil {
    log.Fatal(err)
}
defer conn.Close()

// Send text
conn.SendText("Hello, ")
conn.SendText("this is streaming ")
conn.SendText("text to speech!")

// Flush to finalize
conn.Flush()

// Receive audio chunks
for audio := range conn.Audio() {
    // Play or save audio chunks
    player.Write(audio)
}

With Options

opts := &elevenlabs.WebSocketTTSOptions{
    // Use turbo model for lowest latency
    ModelID: "eleven_turbo_v2_5",

    // PCM format for real-time playback
    OutputFormat: "pcm_16000",

    // Latency optimization (0-4, higher = faster but lower quality)
    OptimizeStreamingLatency: 3,

    // Enable SSML parsing
    EnableSSMLParsing: true,

    // Voice settings
    VoiceSettings: &elevenlabs.VoiceSettings{
        Stability:       0.5,
        SimilarityBoost: 0.75,
    },
}

conn, err := client.WebSocketTTS().Connect(ctx, voiceID, opts)

Streaming from LLM

// Connect to TTS
conn, err := client.WebSocketTTS().Connect(ctx, voiceID, &elevenlabs.WebSocketTTSOptions{
    ModelID:                  "eleven_turbo_v2_5",
    OutputFormat:             "pcm_16000",
    OptimizeStreamingLatency: 3,
})
if err != nil {
    log.Fatal(err)
}
defer conn.Close()

// Stream LLM output to TTS
go func() {
    for chunk := range llmOutputStream {
        if err := conn.SendText(chunk); err != nil {
            log.Printf("send error: %v", err)
            return
        }
    }
    conn.Flush()
}()

// Play audio as it arrives
for audio := range conn.Audio() {
    audioPlayer.Write(audio)
}

Using StreamText Helper

// Create a channel of text chunks
textStream := make(chan string)

// Start streaming (this handles flushing automatically)
audioOut, errOut := conn.StreamText(ctx, textStream)

// Send text chunks
go func() {
    defer close(textStream)
    textStream <- "Hello, "
    textStream <- "world!"
}()

// Receive audio
for audio := range audioOut {
    // Process audio
}

// Check for errors
if err := <-errOut; err != nil {
    log.Printf("streaming error: %v", err)
}

Word Alignments

// Receive word-level timing
go func() {
    for align := range conn.Alignments() {
        for i, char := range align.Characters {
            fmt.Printf("%s: %.3fs - %.3fs\n",
                char,
                align.CharacterStart[i],
                align.CharacterEnd[i])
        }
    }
}()

Error Handling

// Monitor errors
go func() {
    for err := range conn.Errors() {
        log.Printf("WebSocket error: %v", err)
    }
}()

Stream Completion Behavior

ElevenLabs WebSocket TTS does not send an explicit "end of stream" signal. After calling Flush(), the server generates any remaining audio and then waits for more input. If no input arrives within the inactivity timeout (default 20 seconds), the server sends an input_timeout_exceeded error and closes the connection.

This behavior has implications for detecting when audio generation is complete:

Default Behavior

With the default 20-second timeout, your application will wait up to 20 seconds after the last audio chunk before the connection closes:

conn.Flush()

// This loop will block for up to 20 seconds after last audio
for audio := range conn.Audio() {
    player.Write(audio)
}

Faster Completion Detection

For applications that need faster stream completion, set a shorter InactivityTimeout and treat the timeout as successful completion:

opts := &elevenlabs.WebSocketTTSOptions{
    ModelID:           "eleven_turbo_v2_5",
    OutputFormat:      "pcm_16000",
    InactivityTimeout: 5, // 5 seconds instead of 20
}

conn, err := client.WebSocketTTS().Connect(ctx, voiceID, opts)
if err != nil {
    log.Fatal(err)
}
defer conn.Close()

// Send text and flush
conn.SendText("Hello, world!")
conn.Flush()

// Use Done() channel to detect completion
var receivedAudio bool
for {
    select {
    case audio, ok := <-conn.Audio():
        if !ok {
            return // Channel closed
        }
        receivedAudio = true
        player.Write(audio)
    case <-conn.Done():
        // All audio received after flush
        return
    case err := <-conn.Errors():
        // Treat timeout as success if we received audio
        if receivedAudio && strings.Contains(err.Error(), "input_timeout_exceeded") {
            return // Stream completed successfully
        }
        log.Printf("error: %v", err)
        return
    }
}

OmniVoice Provider

The OmniVoice TTS provider handles this automatically by setting a 5-second inactivity timeout and treating the timeout as successful completion when audio was received after flush.

Options Reference

Option Type Default Description
ModelID string eleven_turbo_v2_5 TTS model to use
OutputFormat string pcm_16000 Audio format
VoiceSettings *VoiceSettings nil Voice parameters
OptimizeStreamingLatency int 3 Latency vs quality (0-4)
EnableSSMLParsing bool false Parse SSML in text
LanguageCode string "" ISO language code
ChunkLengthSchedule []int nil Custom chunking
InactivityTimeout int 20 Timeout in seconds

Output Formats

For real-time playback, PCM formats are recommended:

  • pcm_16000 - 16kHz PCM (lowest latency)
  • pcm_22050 - 22.05kHz PCM
  • pcm_24000 - 24kHz PCM
  • pcm_44100 - 44.1kHz PCM (highest quality)

MP3 formats are also available but add encoding latency:

  • mp3_44100_64 - 64kbps MP3
  • mp3_44100_128 - 128kbps MP3
  • mp3_44100_192 - 192kbps MP3