Capabilities / Text to speech
Text to speech
The core capability: send text, get natural, expressive speech back. One request, one clip — or stream it as it generates. The engine renders 24 kHz mono audio and transcodes to whatever output format you ask for.
The request
Every synthesis call is a POST to /v1/text-to-speech/{voice_id} with a JSON
body. The only required field is text.
from zeli_tts import ZeliSpeech
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
audio = client.text_to_speech.convert(
voice_id="zeli-voice-3",
text="This whole passage is spoken as one natural take, "
"so the pacing and intonation flow from start to finish.",
output_format="mp3_44100_128",
)
with open("out.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)curl -X POST \
"https://voice.your-domain.com/v1/text-to-speech/zeli-voice-3?output_format=mp3_44100_128" \
-H "Authorization: Bearer sk-zeli-..." -H "Content-Type: application/json" \
-d '{"text":"A finished clip in one call."}' --output out.mp3See the full Create speech reference for every field and response detail.
Models
model_id is accepted and ignored — every request runs on the one Zeli
Turbo engine. The /v1/models list exists only so that a client with a
hard-coded model_id (like zeli-turbo) resolves cleanly. You
never have to change your model string when migrating.
Send whatever model_id your integration already uses.
It's recorded and echoed but doesn't change the engine. See
Models.
Writing text that sounds human
The engine speaks at its native rate; punctuation and structure steer pacing and intonation. A few reliable levers:
| You write | You get |
|---|---|
, | a short breath / beat |
. | a full stop, falling tone |
... | hesitation, trailing off |
— (em dash) | an abrupt cut or aside |
? / ! | a rising / energetic tone |
| blank line | a paragraph break — spoken as its own take with a short pause |
Tips:
- Write the way people talk: contractions (
it's,you're), short sentences. - Sprinkle real fillers —
well,honestly,you know,— for a conversational read. - Separate distinct points or lines of dialogue with a blank line; each is spoken as its own take.
Expressiveness
Steer the delivery with voice settings — the
ZeliSpeech stability / style / speed voice-settings sliders map onto the
Turbo delivery knobs (cfg_weight, temperature, exaggeration, and a WSOLA
time-stretch for pace). Omit voice_settings to get the engine's lively-neutral
default.
On the native /tts contract, a per-emotion library maps tones (calm, excited,
serious, sad…) to all four delivery knobs at once, so emotions sound genuinely
distinct. The voice_settings mapping is a deliberate approximation
of the same idea.
Choosing how it's delivered
- A finished clip →
POST /v1/text-to-speech/{voice_id}— the whole audio in one response. - Low latency →
/stream— chunked audio, first bytes after the first sentence. - Realtime →
/stream-input— feed text as it's produced (e.g. from an LLM) and receive audio frames live. - With alignment →
.../with-timestamps— the clip plus an approximate character alignment (see Streaming).