Capabilities / Text to speech

Text to speech

The core capability: send text, get natural, expressive speech back. One request, one clip — or stream it as it generates. The engine renders 24 kHz mono audio and transcodes to whatever output format you ask for.

The request

Every synthesis call is a POST to /v1/text-to-speech/{voice_id} with a JSON body. The only required field is text.

from zeli_tts import ZeliSpeech
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
 
audio = client.text_to_speech.convert(
    voice_id="zeli-voice-3",
    text="This whole passage is spoken as one natural take, "
         "so the pacing and intonation flow from start to finish.",
    output_format="mp3_44100_128",
)
with open("out.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

curl -X POST \
  "https://voice.your-domain.com/v1/text-to-speech/zeli-voice-3?output_format=mp3_44100_128" \
  -H "Authorization: Bearer sk-zeli-..." -H "Content-Type: application/json" \
  -d '{"text":"A finished clip in one call."}' --output out.mp3

See the full Create speech reference for every field and response detail.

Models

model_id is accepted and ignored — every request runs on the one Zeli Turbo engine. The /v1/models list exists only so that a client with a hard-coded model_id (like zeli-turbo) resolves cleanly. You never have to change your model string when migrating.

One engine, many aliases

Send whatever model_id your integration already uses. It's recorded and echoed but doesn't change the engine. See Models.

Writing text that sounds human

The engine speaks at its native rate; punctuation and structure steer pacing and intonation. A few reliable levers:

You write	You get
`,`	a short breath / beat
`.`	a full stop, falling tone
`...`	hesitation, trailing off
`—` (em dash)	an abrupt cut or aside
`?` / `!`	a rising / energetic tone
blank line	a paragraph break — spoken as its own take with a short pause

Tips:

Write the way people talk: contractions (it's, you're), short sentences.
Sprinkle real fillers — well, honestly, you know, — for a conversational read.
Separate distinct points or lines of dialogue with a blank line; each is spoken as its own take.

Expressiveness

Steer the delivery with voice settings — the ZeliSpeech stability / style / speed voice-settings sliders map onto the Turbo delivery knobs (cfg_weight, temperature, exaggeration, and a WSOLA time-stretch for pace). Omit voice_settings to get the engine's lively-neutral default.

Emotion presets

On the native /tts contract, a per-emotion library maps tones (calm, excited, serious, sad…) to all four delivery knobs at once, so emotions sound genuinely distinct. The voice_settings mapping is a deliberate approximation of the same idea.

Choosing how it's delivered

A finished clip → POST /v1/text-to-speech/{voice_id} — the whole audio in one response.
Low latency → /stream — chunked audio, first bytes after the first sentence.
Realtime → /stream-input — feed text as it's produced (e.g. from an LLM) and receive audio frames live.
With alignment → .../with-timestamps — the clip plus an approximate character alignment (see Streaming).

Text to speech#

The request#

Models#

Writing text that sounds human#

Expressiveness#

Choosing how it's delivered#

Text to speech

The request

Models

Writing text that sounds human

Expressiveness

Choosing how it's delivered