Zeligate VoiceDeveloper docs
v1.3.0Sign in

API reference / Realtime WebSocket

Realtime WebSocket

Feed text as it's produced — for example straight from an LLM — and receive audio frames live. The protocol mirrors the ZeliSpeech realtime WebSocket: camelCase keys, base64 audio, and an isFinal terminator.

WSS/v1/text-to-speech/{voice_id}/stream-input?output_format=mp3_44100_128

Protocol

The exchange is: one BOS (beginning-of-stream) message, then text messages as they arrive, optional flush / try_trigger_generation triggers, and a final {"text":""} to end.

1. Send the BOS message

{
  "text": " ",
  "voice_settings": { "stability": 0.5, "style": 0.3 },
  "generation_config": { "chunk_length_schedule": [120, 160, 250, 290] },
  "authorization": "Bearer sk-zeli-..."
}
textstringRequired

A single space " " opens the stream.

voice_settingsobjectOptional

Same mapping as HTTP. See Voice settings.

generation_config.chunk_length_scheduleint[]Optional

How many characters to buffer before triggering generation at each step. Smaller first values start audio sooner.

authorizationstringOptional

A Bearer token — one of two ways to authenticate; see Authentication.

2. Stream text

{ "text": "Hello there. " }

A trailing space helps word boundaries. Control generation timing with:

flushbooleanOptional

{"flush": true} forces generation of everything buffered so far.

try_trigger_generationbooleanOptional

{"try_trigger_generation": true} triggers at the next boundary.

3. End the stream

{ "text": "" }

4. Receive audio frames

Each frame is a JSON object with base64 audio; a final frame carries isFinal: true.

{ "audio": "<base64>", "isFinal": null, "normalizedAlignment": null, "alignment": null }
{ "audio": null, "isFinal": true }
Alignment fields are null

alignment and normalizedAlignment are sent as null on every frame — a streamed encoder can't attribute encoded bytes to characters. If you need the approximate alignment, use the HTTP .../with-timestamps endpoints instead. See Character timestamps.

Example

import json, base64, websocket
 
VOICE = "zeli-voice-1"
url = f"wss://voice.your-domain.com/v1/text-to-speech/{VOICE}/stream-input?output_format=mp3_44100_128"
ws = websocket.create_connection(url)
 
# BOS
ws.send(json.dumps({
    "text": " ",
    "voice_settings": {"stability": 0.5, "style": 0.3},
    "generation_config": {"chunk_length_schedule": [120, 160, 250, 290]},
    "authorization": "Bearer sk-zeli-...",
}))
 
for part in ["Hello there. ", "This streams as I type. "]:
    ws.send(json.dumps({"text": part}))
 
ws.send(json.dumps({"text": ""}))    # EOS
 
with open("out.mp3", "wb") as f:
    while True:
        msg = json.loads(ws.recv())
        if msg.get("audio"):
            f.write(base64.b64decode(msg["audio"]))
        if msg.get("isFinal"):
            break
ws.close()
Not yet: multi-context

The multi-context multi-stream-input variant is not implemented yet. Use one stream-input connection per synthesis.

Zeligate Voice API · self-hosted · secure data sovereignty · source