Capabilities / Streaming
Streaming
Two ways to get audio sooner: HTTP streaming returns a clip as chunked bytes so playback starts after the first sentence, and the realtime WebSocket lets you feed text as it's produced and receive audio frames live.
HTTP streaming
POST /v1/text-to-speech/{voice_id}/stream returns the same audio as the
one-shot endpoint, but chunked — time-to-first-byte is low because generation and
delivery overlap. A streamed mp3/opus response is a single valid container
(one continuous encoder for the request).
from zeli_tts import ZeliSpeech
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
audio = client.text_to_speech.stream(
voice_id="zeli-voice-1",
text="First audio arrives after the first sentence, not the whole passage.",
output_format="mp3_44100_128",
)
with open("stream.mp3", "wb") as f:
for chunk in audio:
f.write(chunk) # bytes as they generatecurl -N -X POST \
"https://voice.your-domain.com/v1/text-to-speech/zeli-voice-1/stream?output_format=pcm_24000" \
-H "Authorization: Bearer sk-zeli-..." -H "Content-Type: application/json" \
-d '{"text":"Low latency with raw PCM."}' --output stream.pcmStreaming decouples generation from client-paced delivery: a producer fills an
in-memory buffer while a slow client drains it lock-free. A stalled client
can't hold the batch-1 engine lock and block other requests.
ZELI_MAX_STREAM_CHARS bounds the buffer.
Realtime input streaming (WebSocket)
When text arrives incrementally — for example straight from an LLM — open a
WebSocket to /v1/text-to-speech/{voice_id}/stream-input and push text as you
get it. The server buffers by a chunk_length_schedule and synthesizes at
sentence boundaries, so audio starts after the first short sentence.
The protocol follows mainstream realtime-TTS conventions: a BOS message, then
text messages, flush to force generation, and {"text":""} to end. Audio comes
back as base64 frames followed by a final {"isFinal":true}.
See the full Realtime WebSocket reference for message shapes and auth.
Character timestamps
The .../with-timestamps and .../stream/with-timestamps endpoints return the
clip plus a character alignment.
The Turbo engine has no native character timing. HTTP
timestamp endpoints return a structurally exact but approximate
alignment — the real characters, with times spread evenly across each segment's
measured duration. On the realtime WebSocket, alignment /
normalizedAlignment are sent as null on every frame.
For audio-driven lip-sync, drive from the audio itself. True per-character
timing is a Phase-2 item.
Which one to use
| Need | Use |
|---|---|
| A finished file, simplest code | POST /v1/text-to-speech/{id} |
| Low TTFB for a known block of text | POST /.../stream |
| Text produced incrementally (LLM tokens) | WSS /.../stream-input |
| Telephony (Twilio Media Streams) | /stream with output_format=ulaw_8000 |