API reference / Stream speech
Stream speech
The same synthesis as Create speech,
returned as chunked bytes so playback can start after the first sentence. This
is the ZeliSpeech text_to_speech.stream endpoint.
The path, query, and body are identical to Create
speech — only the transfer changes. A streamed
mp3 or opus response is a single valid container (one continuous encoder per
request), so you can pipe it straight to a file or a player.
Streamed timestamps
Streams the audio plus an approximate character alignment as NDJSON (one JSON object per line). Alignment is structurally exact but evenly spaced — see Streaming › character timestamps.
Examples
from zeli_tts import ZeliSpeech
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
audio = client.text_to_speech.stream(
voice_id="zeli-voice-1",
text="First bytes arrive fast; the rest streams behind them.",
output_format="mp3_44100_128",
)
with open("stream.mp3", "wb") as f:
for chunk in audio: # chunks as they generate
f.write(chunk)curl -N -X POST \
"https://voice.your-domain.com/v1/text-to-speech/zeli-voice-1/stream?output_format=mp3_44100_128" \
-H "Authorization: Bearer sk-zeli-..." \
-H "Content-Type: application/json" \
-d '{"text":"Low time-to-first-byte."}' \
--output stream.mp3from zeli_tts import ZeliSpeech, stream
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
audio = client.text_to_speech.stream(
voice_id="zeli-voice-1",
text="Playing as I generate.",
output_format="mp3_44100_128",
)
stream(audio) # live playback via ffplay / mpvGeneration is decoupled from client-paced delivery — a producer holds the
batch-1 engine lock and fills a bounded in-memory buffer that the client drains
lock-free. A slow or stalled reader can't starve other requests.
ZELI_MAX_STREAM_CHARS caps the buffered text; the source is always
closed on disconnect so the lock is released.
When to prefer streaming
- Conversational agents and long-form reads where latency matters.
- Piping straight to a phone call — pair with
output_format=ulaw_8000. - Anywhere you'd rather start playback than wait for a whole file.
For text that arrives incrementally (LLM tokens), use the realtime WebSocket instead.