Open-weight text-to-speech model for expressive, high-quality speech generation
Top 7.5% on sourcepulse
Zonos-v0.1 is an open-weight text-to-speech model designed for highly natural and expressive speech generation, including zero-shot voice cloning. It targets researchers and developers seeking high-quality, controllable TTS capabilities, offering performance comparable to commercial providers.
How It Works
Zonos utilizes a transformer or hybrid backbone for DAC token prediction, preceded by text normalization and phonemization via eSpeak. This architecture allows for conditioning on speaker embeddings or audio prefixes, enabling fine-grained control over speech rate, pitch, audio quality, and emotions. The model outputs audio natively at 44kHz.
Quick Start & Requirements
uv sync
(or uv sync --extra compile
for hybrid) followed by uv pip install -e .
(or .[compile]
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
5 months ago
1 day