StyleTTS2 by yl4579

Text-to-speech model achieving human-level synthesis

Created 2 years ago

6,117 stars

Top 8.3% on SourcePulse

View on GitHub

6 Experts Love This Project

Tim J. Baek

Founder of Open WebUI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luis Capelo

Cofounder of Lightning AI

Travis Fischer

Founder of Agentic

and 2 more!

Project Summary

StyleTTS 2 is a text-to-speech (TTS) system designed to achieve human-level speech synthesis through a novel approach combining style diffusion and adversarial training with large speech language models (SLMs). It targets researchers and developers seeking state-of-the-art TTS capabilities, offering improved naturalness and zero-shot speaker adaptation.

How It Works

StyleTTS 2 models speech styles as latent variables using diffusion models, enabling the generation of appropriate styles for text without requiring reference audio. This latent diffusion approach leverages the diversity of diffusion models for efficient and high-quality synthesis. The system further enhances speech naturalness by employing large pre-trained SLMs (like WavLM) as discriminators and incorporating a novel differentiable duration modeling technique for end-to-end training.

Quick Start & Requirements

Install: pip install -r requirements.txt (additional torch, torchvision, torchaudio for Windows with CUDA, phonemizer, espeak-ng for demo).
Prerequisites: Python >= 3.7, CUDA >= 11.8 (recommended for Windows), espeak-ng.
Data: LJSpeech or LibriTTS datasets required.
Resources: Training requires significant GPU resources; fine-tuning on 1 hour of data took ~4 hours on 4x A100.
Links: Paper: https://arxiv.org/abs/2306.07691, Audio samples: https://styletts2.github.io/, Hugging Face demo.

Highlighted Details

Achieves human-level TTS synthesis, surpassing human recordings on LJSpeech and matching them on VCTK.
Outperforms previous models in zero-shot speaker adaptation on LibriTTS.
Utilizes style diffusion for style generation without reference speech.
Employs WavLM as discriminators for improved naturalness.

Maintenance & Community

The project is actively maintained by yl4579. Community support and contributions are encouraged, with specific calls for help on DDP issues. Links to community discussions or forums are not explicitly provided in the README.

Licensing & Compatibility

Code: MIT License.
Pre-Trained Models: Custom license requiring disclosure of synthesized speech and explicit permission for voice cloning. Restrictions apply if using voices not from open-access datasets. Inference depends on a GPL-licensed package, with an MIT-licensed alternative using gruut available.

Limitations & Caveats

Distributed Data Parallel (DDP) for the second stage training is not functional, limiting multi-GPU training for this phase. The custom license for pre-trained models may impact commercial use or integration into closed-source projects. High-pitched background noise during inference can occur on older GPUs.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

40 stars in the last 30 days