ChatTTS by 2noise

Generative speech model for daily dialogue

Created 1 year ago

38,760 stars

Top 0.8% on SourcePulse

View on GitHub

5 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

and 1 more!

Project Summary

ChatTTS is a generative speech model optimized for dialogue scenarios, targeting LLM assistants and conversational AI applications. It provides natural, expressive speech synthesis with fine-grained control over prosody, including laughter and pauses, aiming to surpass existing open-source TTS models in conversational quality.

How It Works

ChatTTS employs a novel approach for dialogue-centric speech synthesis, enabling natural and expressive vocalizations. It supports multiple speakers and offers fine-grained control over prosodic features like laughter, pauses, and interjections. The model is trained on a substantial dataset of Chinese and English audio, with a focus on delivering superior prosody compared to other open-source TTS solutions.

Quick Start & Requirements

Install: pip install ChatTTS or pip install -e . for local development.
Prerequisites: Python 3.11+, PyTorch, torchaudio. Optional: vLLM (Linux), TransformerEngine (Linux), FlashAttention-2 (NVIDIA GPU).
Resources: Requires at least 4GB VRAM for a 30-second audio clip. Inference speed on a 4090 GPU is approximately 7 semantic tokens/sec (RTF ~0.3).
Links: Huggingface Models, Colab Example

Highlighted Details

Optimized for conversational AI and LLM assistants.
Supports fine-grained control over prosody (laughter, pauses, interjections).
Achieves superior prosody compared to many open-source TTS models.
Trained on 100,000+ hours of Chinese and English audio data.

Maintenance & Community

Active community support via Discord.
Roadmap includes multi-emotion control and potential C++ implementation.
Contact: open-source@2noise.com

Licensing & Compatibility

Code License: AGPLv3+
Model License: CC BY-NC 4.0 (Non-commercial, educational, and research use only).

Limitations & Caveats

The released model is for academic and research purposes only and cannot be used commercially. The authors have intentionally added noise and compressed audio quality to deter malicious use. English synthesis is noted as experimental.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

231 stars in the last 30 days