StyleTTS2  by yl4579

Text-to-speech model achieving human-level synthesis

created 2 years ago
5,869 stars

Top 8.9% on sourcepulse

GitHubView on GitHub
Project Summary

StyleTTS 2 is a text-to-speech (TTS) system designed to achieve human-level speech synthesis through a novel approach combining style diffusion and adversarial training with large speech language models (SLMs). It targets researchers and developers seeking state-of-the-art TTS capabilities, offering improved naturalness and zero-shot speaker adaptation.

How It Works

StyleTTS 2 models speech styles as latent variables using diffusion models, enabling the generation of appropriate styles for text without requiring reference audio. This latent diffusion approach leverages the diversity of diffusion models for efficient and high-quality synthesis. The system further enhances speech naturalness by employing large pre-trained SLMs (like WavLM) as discriminators and incorporating a novel differentiable duration modeling technique for end-to-end training.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (additional torch, torchvision, torchaudio for Windows with CUDA, phonemizer, espeak-ng for demo).
  • Prerequisites: Python >= 3.7, CUDA >= 11.8 (recommended for Windows), espeak-ng.
  • Data: LJSpeech or LibriTTS datasets required.
  • Resources: Training requires significant GPU resources; fine-tuning on 1 hour of data took ~4 hours on 4x A100.
  • Links: Paper: https://arxiv.org/abs/2306.07691, Audio samples: https://styletts2.github.io/, Hugging Face demo.

Highlighted Details

  • Achieves human-level TTS synthesis, surpassing human recordings on LJSpeech and matching them on VCTK.
  • Outperforms previous models in zero-shot speaker adaptation on LibriTTS.
  • Utilizes style diffusion for style generation without reference speech.
  • Employs WavLM as discriminators for improved naturalness.

Maintenance & Community

The project is actively maintained by yl4579. Community support and contributions are encouraged, with specific calls for help on DDP issues. Links to community discussions or forums are not explicitly provided in the README.

Licensing & Compatibility

  • Code: MIT License.
  • Pre-Trained Models: Custom license requiring disclosure of synthesized speech and explicit permission for voice cloning. Restrictions apply if using voices not from open-access datasets. Inference depends on a GPL-licensed package, with an MIT-licensed alternative using gruut available.

Limitations & Caveats

Distributed Data Parallel (DDP) for the second stage training is not functional, limiting multi-GPU training for this phase. The custom license for pre-trained models may impact commercial use or integration into closed-source projects. High-pitched background noise during inference can occur on older GPUs.

Health Check
Last commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
202 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.