Text-to-speech model achieving human-level synthesis
Top 8.9% on sourcepulse
StyleTTS 2 is a text-to-speech (TTS) system designed to achieve human-level speech synthesis through a novel approach combining style diffusion and adversarial training with large speech language models (SLMs). It targets researchers and developers seeking state-of-the-art TTS capabilities, offering improved naturalness and zero-shot speaker adaptation.
How It Works
StyleTTS 2 models speech styles as latent variables using diffusion models, enabling the generation of appropriate styles for text without requiring reference audio. This latent diffusion approach leverages the diversity of diffusion models for efficient and high-quality synthesis. The system further enhances speech naturalness by employing large pre-trained SLMs (like WavLM) as discriminators and incorporating a novel differentiable duration modeling technique for end-to-end training.
Quick Start & Requirements
pip install -r requirements.txt
(additional torch
, torchvision
, torchaudio
for Windows with CUDA, phonemizer
, espeak-ng
for demo).Highlighted Details
Maintenance & Community
The project is actively maintained by yl4579. Community support and contributions are encouraged, with specific calls for help on DDP issues. Links to community discussions or forums are not explicitly provided in the README.
Licensing & Compatibility
Limitations & Caveats
Distributed Data Parallel (DDP) for the second stage training is not functional, limiting multi-GPU training for this phase. The custom license for pre-trained models may impact commercial use or integration into closed-source projects. High-pitched background noise during inference can occur on older GPUs.
11 months ago
Inactive