Zonos  by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

Created 11 months ago
7,147 stars

Top 7.2% on SourcePulse

GitHubView on GitHub
Project Summary

Zonos-v0.1 is an open-weight text-to-speech model designed for highly natural and expressive speech generation, including zero-shot voice cloning. It targets researchers and developers seeking high-quality, controllable TTS capabilities, offering performance comparable to commercial providers.

How It Works

Zonos utilizes a transformer or hybrid backbone for DAC token prediction, preceded by text normalization and phonemization via eSpeak. This architecture allows for conditioning on speaker embeddings or audio prefixes, enabling fine-grained control over speech rate, pitch, audio quality, and emotions. The model outputs audio natively at 44kHz.

Quick Start & Requirements

  • Install: uv sync (or uv sync --extra compile for hybrid) followed by uv pip install -e . (or .[compile]).
  • Prerequisites: Linux (Ubuntu 22.04/24.04 recommended), macOS. GPU with 6GB+ VRAM (Nvidia 3000-series+ for hybrid). eSpeak-ng library.
  • Resources: CPU-only is possible but slow. Docker installation is available.
  • Demo: playground.zyphra.com/audio

Highlighted Details

  • Zero-shot TTS with voice cloning from short audio samples.
  • Supports multilingual generation (English, Japanese, Chinese, French, German).
  • Fine-grained control over speaking rate, pitch, audio quality, and emotions (happiness, fear, sadness, anger).
  • Real-time factor of ~2x on an RTX 4090.
  • Includes a Gradio WebUI for easy speech generation.

Maintenance & Community

  • No specific contributors, sponsorships, or community links (Discord/Slack, roadmap) are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

  • Experimental Windows support is available via a fork. The hybrid model has specific GPU requirements. The README does not detail any known bugs or deprecations.
Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.2%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 10 months ago
Updated 1 month ago
Feedback? Help us improve.