Zeligate VoiceDeveloper docs
v1.3.0Sign in

API reference / Stream speech

Stream speech

The same synthesis as Create speech, returned as chunked bytes so playback can start after the first sentence. This is the ZeliSpeech text_to_speech.stream endpoint.

POST/v1/text-to-speech/{voice_id}/stream

The path, query, and body are identical to Create speech — only the transfer changes. A streamed mp3 or opus response is a single valid container (one continuous encoder per request), so you can pipe it straight to a file or a player.

Streamed timestamps

POST/v1/text-to-speech/{voice_id}/stream/with-timestamps

Streams the audio plus an approximate character alignment as NDJSON (one JSON object per line). Alignment is structurally exact but evenly spaced — see Streaming › character timestamps.

Examples

from zeli_tts import ZeliSpeech
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
 
audio = client.text_to_speech.stream(
    voice_id="zeli-voice-1",
    text="First bytes arrive fast; the rest streams behind them.",
    output_format="mp3_44100_128",
)
with open("stream.mp3", "wb") as f:
    for chunk in audio:      # chunks as they generate
        f.write(chunk)
Backpressure is handled

Generation is decoupled from client-paced delivery — a producer holds the batch-1 engine lock and fills a bounded in-memory buffer that the client drains lock-free. A slow or stalled reader can't starve other requests. ZELI_MAX_STREAM_CHARS caps the buffered text; the source is always closed on disconnect so the lock is released.

When to prefer streaming

  • Conversational agents and long-form reads where latency matters.
  • Piping straight to a phone call — pair with output_format=ulaw_8000.
  • Anywhere you'd rather start playback than wait for a whole file.

For text that arrives incrementally (LLM tokens), use the realtime WebSocket instead.

Zeligate Voice API · self-hosted · secure data sovereignty · source