StableTTS by KdaiP

TTS model using flow-matching and DiT, inspired by Stable Diffusion 3

Created 1 year ago

434 stars

Top 68.6% on SourcePulse

Project Summary

StableTTS is a fast, lightweight, and multilingual text-to-speech (TTS) model that leverages flow-matching and Diffusion Transformer (DiT) architectures, inspired by Stable Diffusion 3. It aims to provide high-quality speech generation for Chinese, English, and Japanese with a manageable 31 million parameters, making it suitable for researchers and developers seeking efficient and advanced TTS capabilities.

How It Works

StableTTS combines flow-matching for efficient sequence generation with a Diffusion Transformer (DiT) decoder. The DiT block incorporates U-Net-like long skip connections and a FiLM layer to condition timestep embeddings, enhancing prosody control. This hybrid approach, utilizing ODE solvers from torchdiffeq and a cosine timestep scheduler, aims for faster inference and improved audio quality compared to traditional diffusion or autoregressive models.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: PyTorch (v2.4 recommended), Python 3.12. Requires downloading pretrained models for text-to-mel and mel-to-wav (Vocos or FireflyGAN).
Resources: Setup involves downloading models; inference and training resource requirements are not explicitly detailed but expect GPU usage for performance.
Links: Huggingface demo, PyTorch installation, Vocoder download, Text-to-Mel download.

Highlighted Details

Achieves improved audio quality in V1.1 with fixes for mel spectrogram and attention mask.
Supports Classifier-Free Guidance (CFG) and integrates the FireflyGAN vocoder.
Features a U-Net-like DiT decoder and improved Chinese text frontend.
Offers multilingual support (Chinese, English, Japanese) within a single checkpoint.

Maintenance & Community

The project is actively developed, with V1.1 released in September 2024 addressing significant quality improvements. Further updates, including a new autoregressive TTS model, are planned. Community links are not explicitly provided in the README.

Licensing & Compatibility

The project is released under the Apache-2.0 license. A disclaimer prohibits using the technology for generating or editing speech without consent, particularly for public figures, citing potential copyright law violations.

Limitations & Caveats

The project's disclaimer highlights ethical concerns and potential legal ramifications regarding unauthorized speech generation. Specific hardware requirements beyond PyTorch compatibility are not detailed, and training resource needs are not quantified.

StableTTS by KdaiP

Explore Similar Projects

pytvzhen by CuSO4Gem

Modelscope_Faster_Whisper_Multi_Subtitle by v3ucn

SimulStreaming by ufal

FireRedTTS by FireRedTeam

parrots by shibing624

hibiki by kyutai-labs

IMS-Toucan by DigitalPhonetics

RealtimeTTS by KoljaB

voice-pro by abus-aikorea

seamless_communication by facebookresearch

PaddleSpeech by PaddlePaddle

CosyVoice by FunAudioLLM