Capabilities / Voice settings
Voice settings
The ZeliSpeech voice_settings sliders are accepted and mapped onto
the Zeli Turbo delivery knobs. Send the same body your integration already
sends — it's translated, not rejected.
The mapping
| voice_settings | Range | Maps to | Effect |
|---|---|---|---|
style | 0–1 | exaggeration | expressiveness (0 ≈ neutral, 1 ≈ animated) |
stability | 0–1 | cfg_weight ↑ + temperature ↓ | steadier / more monotone as it rises |
speed | ~0.7–1.2 | speed (WSOLA) | pitch-preserving pace |
similarity_boost | 0–1 | (no-op) | timbre is fixed by the reference clip |
use_speaker_boost | bool | (no-op) | timbre is fixed by the reference clip |
This mapping is an intentional approximation — the voice-settings sliders and
the Turbo knobs aren't the same axes. It's tuned to feel familiar, not
identical. Omit voice_settings entirely to get the engine's
lively-neutral default.
Sending voice settings
from zeli_tts import ZeliSpeech, VoiceSettings
client = ZeliSpeech(api_key="sk-zeli-...", base_url="https://voice.your-domain.com")
audio = client.text_to_speech.convert(
voice_id="zeli-voice-1",
text="Steadier and a little more expressive.",
voice_settings=VoiceSettings(stability=0.6, style=0.35, speed=1.0),
)curl -X POST "https://voice.your-domain.com/v1/text-to-speech/zeli-voice-1" \
-H "Authorization: Bearer sk-zeli-..." -H "Content-Type: application/json" \
-d '{
"text": "Steadier and a little more expressive.",
"voice_settings": { "stability": 0.6, "style": 0.35, "speed": 1.0 }
}' --output out.mp3Reading defaults
Two endpoints return voice-settings shapes for a UI:
Both are authed when a key is set and return the ZeliSpeech voice-settings shape, so a client that reads defaults before rendering sliders works unchanged.
The underlying knobs
If you use the native Zeli contract directly (or the Zeli SDK), you can set the Turbo knobs without going through the voice_settings mapping:
Expressiveness. Higher is more animated.
Guidance weight. Higher is steadier / more faithful, lower is more dynamic.
Sampling randomness. Lower is more monotone and stable.
Pitch-preserving time-stretch (WSOLA).
A per-emotion library on the native contract maps named tones to all four knobs
at once, so presets like excited, calm, and serious
sound distinct. The voice_settings mapping above is a convenient
shorthand for the same delivery system.