Capabilities / Streaming

Streaming

Two ways to get audio sooner: HTTP streaming returns a clip as chunked bytes so playback starts after the first sentence, and the realtime WebSocket lets you feed text as it's produced and receive audio frames live.

HTTP streaming

POST /v1/text-to-speech/{voice_id}/stream returns the same audio as the one-shot endpoint, but chunked — time-to-first-byte is low because generation and delivery overlap. A streamed mp3/opus response is a single valid container (one continuous encoder for the request).

from zeli_tts import ZeliSpeech
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
 
audio = client.text_to_speech.stream(
    voice_id="zeli-voice-1",
    text="First audio arrives after the first sentence, not the whole passage.",
    output_format="mp3_44100_128",
)
with open("stream.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)          # bytes as they generate

curl -N -X POST \
  "https://voice.your-domain.com/v1/text-to-speech/zeli-voice-1/stream?output_format=pcm_24000" \
  -H "Authorization: Bearer sk-zeli-..." -H "Content-Type: application/json" \
  -d '{"text":"Low latency with raw PCM."}' --output stream.pcm

Doesn't starve the box

Streaming decouples generation from client-paced delivery: a producer fills an in-memory buffer while a slow client drains it lock-free. A stalled client can't hold the batch-1 engine lock and block other requests. ZELI_MAX_STREAM_CHARS bounds the buffer.

Realtime input streaming (WebSocket)

When text arrives incrementally — for example straight from an LLM — open a WebSocket to /v1/text-to-speech/{voice_id}/stream-input and push text as you get it. The server buffers by a chunk_length_schedule and synthesizes at sentence boundaries, so audio starts after the first short sentence.

The protocol follows mainstream realtime-TTS conventions: a BOS message, then text messages, flush to force generation, and {"text":""} to end. Audio comes back as base64 frames followed by a final {"isFinal":true}.

See the full Realtime WebSocket reference for message shapes and auth.

Character timestamps

The .../with-timestamps and .../stream/with-timestamps endpoints return the clip plus a character alignment.

Alignment is approximate

The Turbo engine has no native character timing. HTTP timestamp endpoints return a structurally exact but approximate alignment — the real characters, with times spread evenly across each segment's measured duration. On the realtime WebSocket, alignment / normalizedAlignment are sent as null on every frame. For audio-driven lip-sync, drive from the audio itself. True per-character timing is a Phase-2 item.

Which one to use

Need	Use
A finished file, simplest code	`POST /v1/text-to-speech/{id}`
Low TTFB for a known block of text	`POST /.../stream`
Text produced incrementally (LLM tokens)	`WSS /.../stream-input`
Telephony (Twilio Media Streams)	`/stream` with `output_format=ulaw_8000`

Streaming#

HTTP streaming#

Realtime input streaming (WebSocket)#

Character timestamps#

Which one to use#

Streaming

HTTP streaming

Realtime input streaming (WebSocket)

Character timestamps

Which one to use