Zeligate VoiceDeveloper docs
v1.3.0Sign in

Capabilities / Text to speech

Text to speech

The core capability: send text, get natural, expressive speech back. One request, one clip — or stream it as it generates. The engine renders 24 kHz mono audio and transcodes to whatever output format you ask for.

The request

Every synthesis call is a POST to /v1/text-to-speech/{voice_id} with a JSON body. The only required field is text.

from zeli_tts import ZeliSpeech
 
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
 
audio = client.text_to_speech.convert(
    voice_id="zeli-voice-3",
    text="This whole passage is spoken as one natural take, "
         "so the pacing and intonation flow from start to finish.",
    output_format="mp3_44100_128",
)
with open("out.mp3", "wb") as f:
    for chunk in audio:
        f.write(chunk)

See the full Create speech reference for every field and response detail.

Models

model_id is accepted and ignored — every request runs on the one Zeli Turbo engine. The /v1/models list exists only so that a client with a hard-coded model_id (like zeli-turbo) resolves cleanly. You never have to change your model string when migrating.

One engine, many aliases

Send whatever model_id your integration already uses. It's recorded and echoed but doesn't change the engine. See Models.

Writing text that sounds human

The engine speaks at its native rate; punctuation and structure steer pacing and intonation. A few reliable levers:

You writeYou get
,a short breath / beat
.a full stop, falling tone
...hesitation, trailing off
(em dash)an abrupt cut or aside
? / !a rising / energetic tone
blank linea paragraph break — spoken as its own take with a short pause

Tips:

  • Write the way people talk: contractions (it's, you're), short sentences.
  • Sprinkle real fillers — well, honestly, you know, — for a conversational read.
  • Separate distinct points or lines of dialogue with a blank line; each is spoken as its own take.

Expressiveness

Steer the delivery with voice settings — the ZeliSpeech stability / style / speed voice-settings sliders map onto the Turbo delivery knobs (cfg_weight, temperature, exaggeration, and a WSOLA time-stretch for pace). Omit voice_settings to get the engine's lively-neutral default.

Emotion presets

On the native /tts contract, a per-emotion library maps tones (calm, excited, serious, sad…) to all four delivery knobs at once, so emotions sound genuinely distinct. The voice_settings mapping is a deliberate approximation of the same idea.

Choosing how it's delivered

  • A finished clipPOST /v1/text-to-speech/{voice_id} — the whole audio in one response.
  • Low latency/stream — chunked audio, first bytes after the first sentence.
  • Realtime/stream-input — feed text as it's produced (e.g. from an LLM) and receive audio frames live.
  • With alignment.../with-timestamps — the clip plus an approximate character alignment (see Streaming).
Zeligate Voice API · self-hosted · secure data sovereignty · source