Zeligate VoiceDeveloper docs
v1.3.0Sign in

Capabilities / Streaming

Streaming

Two ways to get audio sooner: HTTP streaming returns a clip as chunked bytes so playback starts after the first sentence, and the realtime WebSocket lets you feed text as it's produced and receive audio frames live.

HTTP streaming

POST /v1/text-to-speech/{voice_id}/stream returns the same audio as the one-shot endpoint, but chunked — time-to-first-byte is low because generation and delivery overlap. A streamed mp3/opus response is a single valid container (one continuous encoder for the request).

from zeli_tts import ZeliSpeech
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
 
audio = client.text_to_speech.stream(
    voice_id="zeli-voice-1",
    text="First audio arrives after the first sentence, not the whole passage.",
    output_format="mp3_44100_128",
)
with open("stream.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)          # bytes as they generate
Doesn't starve the box

Streaming decouples generation from client-paced delivery: a producer fills an in-memory buffer while a slow client drains it lock-free. A stalled client can't hold the batch-1 engine lock and block other requests. ZELI_MAX_STREAM_CHARS bounds the buffer.

Realtime input streaming (WebSocket)

When text arrives incrementally — for example straight from an LLM — open a WebSocket to /v1/text-to-speech/{voice_id}/stream-input and push text as you get it. The server buffers by a chunk_length_schedule and synthesizes at sentence boundaries, so audio starts after the first short sentence.

The protocol follows mainstream realtime-TTS conventions: a BOS message, then text messages, flush to force generation, and {"text":""} to end. Audio comes back as base64 frames followed by a final {"isFinal":true}.

See the full Realtime WebSocket reference for message shapes and auth.

Character timestamps

The .../with-timestamps and .../stream/with-timestamps endpoints return the clip plus a character alignment.

Alignment is approximate

The Turbo engine has no native character timing. HTTP timestamp endpoints return a structurally exact but approximate alignment — the real characters, with times spread evenly across each segment's measured duration. On the realtime WebSocket, alignment / normalizedAlignment are sent as null on every frame. For audio-driven lip-sync, drive from the audio itself. True per-character timing is a Phase-2 item.

Which one to use

NeedUse
A finished file, simplest codePOST /v1/text-to-speech/{id}
Low TTFB for a known block of textPOST /.../stream
Text produced incrementally (LLM tokens)WSS /.../stream-input
Telephony (Twilio Media Streams)/stream with output_format=ulaw_8000
Zeligate Voice API · self-hosted · secure data sovereignty · source