StableTTS  by KdaiP

TTS model using flow-matching and DiT, inspired by Stable Diffusion 3

created 1 year ago
415 stars

Top 71.7% on sourcepulse

GitHubView on GitHub
Project Summary

StableTTS is a fast, lightweight, and multilingual text-to-speech (TTS) model that leverages flow-matching and Diffusion Transformer (DiT) architectures, inspired by Stable Diffusion 3. It aims to provide high-quality speech generation for Chinese, English, and Japanese with a manageable 31 million parameters, making it suitable for researchers and developers seeking efficient and advanced TTS capabilities.

How It Works

StableTTS combines flow-matching for efficient sequence generation with a Diffusion Transformer (DiT) decoder. The DiT block incorporates U-Net-like long skip connections and a FiLM layer to condition timestep embeddings, enhancing prosody control. This hybrid approach, utilizing ODE solvers from torchdiffeq and a cosine timestep scheduler, aims for faster inference and improved audio quality compared to traditional diffusion or autoregressive models.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: PyTorch (v2.4 recommended), Python 3.12. Requires downloading pretrained models for text-to-mel and mel-to-wav (Vocos or FireflyGAN).
  • Resources: Setup involves downloading models; inference and training resource requirements are not explicitly detailed but expect GPU usage for performance.
  • Links: Huggingface demo, PyTorch installation, Vocoder download, Text-to-Mel download.

Highlighted Details

  • Achieves improved audio quality in V1.1 with fixes for mel spectrogram and attention mask.
  • Supports Classifier-Free Guidance (CFG) and integrates the FireflyGAN vocoder.
  • Features a U-Net-like DiT decoder and improved Chinese text frontend.
  • Offers multilingual support (Chinese, English, Japanese) within a single checkpoint.

Maintenance & Community

The project is actively developed, with V1.1 released in September 2024 addressing significant quality improvements. Further updates, including a new autoregressive TTS model, are planned. Community links are not explicitly provided in the README.

Licensing & Compatibility

The project is released under the Apache-2.0 license. A disclaimer prohibits using the technology for generating or editing speech without consent, particularly for public figures, citing potential copyright law violations.

Limitations & Caveats

The project's disclaimer highlights ethical concerns and potential legal ramifications regarding unauthorized speech generation. Specific hardware requirements beyond PyTorch compatibility are not detailed, and training resource needs are not quantified.

Health Check
Last commit

10 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.