API reference / Realtime WebSocket
Realtime WebSocket
Feed text as it's produced — for example straight from an LLM — and receive
audio frames live. The protocol mirrors the ZeliSpeech realtime WebSocket:
camelCase keys, base64 audio, and an isFinal terminator.
Protocol
The exchange is: one BOS (beginning-of-stream) message, then text messages as
they arrive, optional flush / try_trigger_generation triggers, and a final
{"text":""} to end.
1. Send the BOS message
{
"text": " ",
"voice_settings": { "stability": 0.5, "style": 0.3 },
"generation_config": { "chunk_length_schedule": [120, 160, 250, 290] },
"authorization": "Bearer sk-zeli-..."
}A single space " " opens the stream.
Same mapping as HTTP. See Voice settings.
How many characters to buffer before triggering generation at each step. Smaller first values start audio sooner.
A Bearer token — one of two ways to authenticate; see Authentication.
2. Stream text
{ "text": "Hello there. " }A trailing space helps word boundaries. Control generation timing with:
{"flush": true} forces generation of everything buffered so far.
{"try_trigger_generation": true} triggers at the next boundary.
3. End the stream
{ "text": "" }4. Receive audio frames
Each frame is a JSON object with base64 audio; a final frame carries
isFinal: true.
{ "audio": "<base64>", "isFinal": null, "normalizedAlignment": null, "alignment": null }{ "audio": null, "isFinal": true }alignment and normalizedAlignment are sent as
null on every frame — a streamed encoder can't attribute encoded
bytes to characters. If you need the approximate alignment, use the HTTP
.../with-timestamps endpoints instead. See
Character timestamps.
Example
import json, base64, websocket
VOICE = "zeli-voice-1"
url = f"wss://voice.your-domain.com/v1/text-to-speech/{VOICE}/stream-input?output_format=mp3_44100_128"
ws = websocket.create_connection(url)
# BOS
ws.send(json.dumps({
"text": " ",
"voice_settings": {"stability": 0.5, "style": 0.3},
"generation_config": {"chunk_length_schedule": [120, 160, 250, 290]},
"authorization": "Bearer sk-zeli-...",
}))
for part in ["Hello there. ", "This streams as I type. "]:
ws.send(json.dumps({"text": part}))
ws.send(json.dumps({"text": ""})) # EOS
with open("out.mp3", "wb") as f:
while True:
msg = json.loads(ws.recv())
if msg.get("audio"):
f.write(base64.b64decode(msg["audio"]))
if msg.get("isFinal"):
break
ws.close()const voice = "zeli-voice-1";
const ws = new WebSocket(
`wss://voice.your-domain.com/v1/text-to-speech/${voice}/stream-input` +
`?output_format=mp3_44100_128&authorization=Bearer%20sk-zeli-...`
);
ws.onopen = () => {
ws.send(JSON.stringify({
text: " ",
voice_settings: { stability: 0.5, style: 0.3 },
generation_config: { chunk_length_schedule: [120, 160, 250, 290] },
}));
ws.send(JSON.stringify({ text: "Hello there. " }));
ws.send(JSON.stringify({ text: "" })); // EOS
};
ws.onmessage = (ev) => {
const msg = JSON.parse(ev.data);
if (msg.audio) {
const bytes = Uint8Array.from(atob(msg.audio), (c) => c.charCodeAt(0));
/* append to a MediaSource / decode / play */
}
if (msg.isFinal) ws.close();
};The multi-context multi-stream-input variant is not implemented
yet. Use one stream-input connection per synthesis.