Zonos by Zyphra

Open-weight text-to-speech model for expressive, high-quality speech generation

Created 1 year ago

7,162 stars

Top 7.2% on SourcePulse

2 Experts Love This Project

luiscape

Cofounder of Lightning AI

didierrlopes

Founder of OpenBB

Project Summary

Zonos-v0.1 is an open-weight text-to-speech model designed for highly natural and expressive speech generation, including zero-shot voice cloning. It targets researchers and developers seeking high-quality, controllable TTS capabilities, offering performance comparable to commercial providers.

How It Works

Zonos utilizes a transformer or hybrid backbone for DAC token prediction, preceded by text normalization and phonemization via eSpeak. This architecture allows for conditioning on speaker embeddings or audio prefixes, enabling fine-grained control over speech rate, pitch, audio quality, and emotions. The model outputs audio natively at 44kHz.

Quick Start & Requirements

Install: uv sync (or uv sync --extra compile for hybrid) followed by uv pip install -e . (or .[compile]).
Prerequisites: Linux (Ubuntu 22.04/24.04 recommended), macOS. GPU with 6GB+ VRAM (Nvidia 3000-series+ for hybrid). eSpeak-ng library.
Resources: CPU-only is possible but slow. Docker installation is available.
Demo: playground.zyphra.com/audio

Highlighted Details

Zero-shot TTS with voice cloning from short audio samples.
Supports multilingual generation (English, Japanese, Chinese, French, German).
Fine-grained control over speaking rate, pitch, audio quality, and emotions (happiness, fear, sadness, anger).
Real-time factor of ~2x on an RTX 4090.
Includes a Gradio WebUI for easy speech generation.

Maintenance & Community

No specific contributors, sponsorships, or community links (Discord/Slack, roadmap) are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

Experimental Windows support is available via a fork. The hybrid model has specific GPU requirements. The README does not detail any known bugs or deprecations.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

42 stars in the last 30 days

Explore Similar Projects

praises by ElmTran

Text-to-speech tool for easy reading

Created 1 year ago

Updated 7 months ago

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

Updated 2 years ago

SpeechGPT-2.0-preview by OpenMOSS

Real-time spoken dialogue system with GPT-4o-level capabilities

Created 1 year ago

Updated 1 year ago

FireRedTTS by FireRedTeam

LLM-empowered TTS system for research

Created 1 year ago

Updated 5 months ago

fast-voice-assistant by dsa

AI voice assistant demo with <500ms response

Created 1 year ago

Updated 1 year ago

FireRedTTS2 by FireRedTeam

Streaming TTS for natural, long-form dialogue

Created 5 months ago

Updated 4 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

1 more.

LLaMA-Omni by ictnlp

Speech-language model for low-latency, high-quality speech interaction

Created 1 year ago

Updated 9 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

Step-Audio by stepfun-ai

Speech interaction framework for multilingual conversation and controllable speech synthesis

Created 1 year ago

Updated 1 week ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Michael Han

Michael Han(Cofounder of Unsloth), and

1 more.

Orpheus-TTS by canopyai

Open-source TTS for human-sounding speech, built on Llama-3b

Created 11 months ago

Updated 2 months ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

3 more.

KittenTTS by KittenML

Realistic text-to-speech model under 25MB

Created 6 months ago

Updated 1 day ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Qwen3-TTS by QwenLM

Powerful speech generation models for diverse applications

Created 1 month ago

Updated 2 weeks ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.