API reference / Realtime WebSocket

Realtime WebSocket

Feed text as it's produced — for example straight from an LLM — and receive audio frames live. The protocol mirrors the ZeliSpeech realtime WebSocket: camelCase keys, base64 audio, and an isFinal terminator.

WSS/v1/text-to-speech/{voice_id}/stream-input?output_format=mp3_44100_128

Protocol

The exchange is: one BOS (beginning-of-stream) message, then text messages as they arrive, optional flush / try_trigger_generation triggers, and a final {"text":""} to end.

1. Send the BOS message

{
  "text": " ",
  "voice_settings": { "stability": 0.5, "style": 0.3 },
  "generation_config": { "chunk_length_schedule": [120, 160, 250, 290] },
  "authorization": "Bearer sk-zeli-..."
}

textstringRequired

A single space " " opens the stream.

voice_settingsobjectOptional

Same mapping as HTTP. See Voice settings.

generation_config.chunk_length_scheduleint[]Optional

How many characters to buffer before triggering generation at each step. Smaller first values start audio sooner.

authorizationstringOptional

A Bearer token — one of two ways to authenticate; see Authentication.

2. Stream text

{ "text": "Hello there. " }

A trailing space helps word boundaries. Control generation timing with:

flushbooleanOptional

{"flush": true} forces generation of everything buffered so far.

try_trigger_generationbooleanOptional

{"try_trigger_generation": true} triggers at the next boundary.

3. End the stream

{ "text": "" }

4. Receive audio frames

Each frame is a JSON object with base64 audio; a final frame carries isFinal: true.

{ "audio": "<base64>", "isFinal": null, "normalizedAlignment": null, "alignment": null }

{ "audio": null, "isFinal": true }

Alignment fields are null

alignment and normalizedAlignment are sent as null on every frame — a streamed encoder can't attribute encoded bytes to characters. If you need the approximate alignment, use the HTTP .../with-timestamps endpoints instead. See Character timestamps.

Example

import json, base64, websocket
 
VOICE = "zeli-voice-1"
url = f"wss://voice.your-domain.com/v1/text-to-speech/{VOICE}/stream-input?output_format=mp3_44100_128"
ws = websocket.create_connection(url)
 
# BOS
ws.send(json.dumps({
    "text": " ",
    "voice_settings": {"stability": 0.5, "style": 0.3},
    "generation_config": {"chunk_length_schedule": [120, 160, 250, 290]},
    "authorization": "Bearer sk-zeli-...",
}))
 
for part in ["Hello there. ", "This streams as I type. "]:
    ws.send(json.dumps({"text": part}))
 
ws.send(json.dumps({"text": ""}))    # EOS
 
with open("out.mp3", "wb") as f:
    while True:
        msg = json.loads(ws.recv())
        if msg.get("audio"):
            f.write(base64.b64decode(msg["audio"]))
        if msg.get("isFinal"):
            break
ws.close()

const voice = "zeli-voice-1";
const ws = new WebSocket(
  `wss://voice.your-domain.com/v1/text-to-speech/${voice}/stream-input` +
  `?output_format=mp3_44100_128&authorization=Bearer%20sk-zeli-...`
);
 
ws.onopen = () => {
  ws.send(JSON.stringify({
    text: " ",
    voice_settings: { stability: 0.5, style: 0.3 },
    generation_config: { chunk_length_schedule: [120, 160, 250, 290] },
  }));
  ws.send(JSON.stringify({ text: "Hello there. " }));
  ws.send(JSON.stringify({ text: "" })); // EOS
};
 
ws.onmessage = (ev) => {
  const msg = JSON.parse(ev.data);
  if (msg.audio) {
    const bytes = Uint8Array.from(atob(msg.audio), (c) => c.charCodeAt(0));
    /* append to a MediaSource / decode / play */
  }
  if (msg.isFinal) ws.close();
};

Not yet: multi-context

The multi-context multi-stream-input variant is not implemented yet. Use one stream-input connection per synthesis.

Realtime WebSocket#

Protocol#

1. Send the BOS message#

2. Stream text#

3. End the stream#

4. Receive audio frames#

Example#

Realtime WebSocket

Protocol

1. Send the BOS message

2. Stream text

3. End the stream

4. Receive audio frames

Example