API reference / Stream speech

Stream speech

The same synthesis as Create speech, returned as chunked bytes so playback can start after the first sentence. This is the ZeliSpeech text_to_speech.stream endpoint.

POST/v1/text-to-speech/{voice_id}/stream

The path, query, and body are identical to Create speech — only the transfer changes. A streamed mp3 or opus response is a single valid container (one continuous encoder per request), so you can pipe it straight to a file or a player.

Streamed timestamps

POST/v1/text-to-speech/{voice_id}/stream/with-timestamps

Streams the audio plus an approximate character alignment as NDJSON (one JSON object per line). Alignment is structurally exact but evenly spaced — see Streaming › character timestamps.

Examples

from zeli_tts import ZeliSpeech
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
 
audio = client.text_to_speech.stream(
    voice_id="zeli-voice-1",
    text="First bytes arrive fast; the rest streams behind them.",
    output_format="mp3_44100_128",
)
with open("stream.mp3", "wb") as f:
    for chunk in audio:      # chunks as they generate
        f.write(chunk)

curl -N -X POST \
  "https://voice.your-domain.com/v1/text-to-speech/zeli-voice-1/stream?output_format=mp3_44100_128" \
  -H "Authorization: Bearer sk-zeli-..." \
  -H "Content-Type: application/json" \
  -d '{"text":"Low time-to-first-byte."}' \
  --output stream.mp3

from zeli_tts import ZeliSpeech, stream
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
audio = client.text_to_speech.stream(
    voice_id="zeli-voice-1",
    text="Playing as I generate.",
    output_format="mp3_44100_128",
)
stream(audio)    # live playback via ffplay / mpv

Backpressure is handled

Generation is decoupled from client-paced delivery — a producer holds the batch-1 engine lock and fills a bounded in-memory buffer that the client drains lock-free. A slow or stalled reader can't starve other requests. ZELI_MAX_STREAM_CHARS caps the buffered text; the source is always closed on disconnect so the lock is released.

When to prefer streaming

Conversational agents and long-form reads where latency matters.
Piping straight to a phone call — pair with output_format=ulaw_8000.
Anywhere you'd rather start playback than wait for a whole file.

For text that arrives incrementally (LLM tokens), use the realtime WebSocket instead.

Stream speech#

Streamed timestamps#

Examples#

When to prefer streaming#

Stream speech

Streamed timestamps

Examples

When to prefer streaming