Zonos  by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

created 5 months ago
6,874 stars

Top 7.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Zonos-v0.1 is an open-weight text-to-speech model designed for highly natural and expressive speech generation, including zero-shot voice cloning. It targets researchers and developers seeking high-quality, controllable TTS capabilities, offering performance comparable to commercial providers.

How It Works

Zonos utilizes a transformer or hybrid backbone for DAC token prediction, preceded by text normalization and phonemization via eSpeak. This architecture allows for conditioning on speaker embeddings or audio prefixes, enabling fine-grained control over speech rate, pitch, audio quality, and emotions. The model outputs audio natively at 44kHz.

Quick Start & Requirements

  • Install: uv sync (or uv sync --extra compile for hybrid) followed by uv pip install -e . (or .[compile]).
  • Prerequisites: Linux (Ubuntu 22.04/24.04 recommended), macOS. GPU with 6GB+ VRAM (Nvidia 3000-series+ for hybrid). eSpeak-ng library.
  • Resources: CPU-only is possible but slow. Docker installation is available.
  • Demo: playground.zyphra.com/audio

Highlighted Details

  • Zero-shot TTS with voice cloning from short audio samples.
  • Supports multilingual generation (English, Japanese, Chinese, French, German).
  • Fine-grained control over speaking rate, pitch, audio quality, and emotions (happiness, fear, sadness, anger).
  • Real-time factor of ~2x on an RTX 4090.
  • Includes a Gradio WebUI for easy speech generation.

Maintenance & Community

  • No specific contributors, sponsorships, or community links (Discord/Slack, roadmap) are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

  • Experimental Windows support is available via a fork. The hybrid model has specific GPU requirements. The README does not detail any known bugs or deprecations.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
406 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

MiniCPM-o by OpenBMB

0.2%
20k
MLLM for vision, speech, and multimodal live streaming on your phone
created 1 year ago
updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

GPT-SoVITS by RVC-Boss

0.6%
49k
Few-shot voice cloning and TTS web UI
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.