TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
Top 71.7% on sourcepulse
StableTTS is a fast, lightweight, and multilingual text-to-speech (TTS) model that leverages flow-matching and Diffusion Transformer (DiT) architectures, inspired by Stable Diffusion 3. It aims to provide high-quality speech generation for Chinese, English, and Japanese with a manageable 31 million parameters, making it suitable for researchers and developers seeking efficient and advanced TTS capabilities.
How It Works
StableTTS combines flow-matching for efficient sequence generation with a Diffusion Transformer (DiT) decoder. The DiT block incorporates U-Net-like long skip connections and a FiLM layer to condition timestep embeddings, enhancing prosody control. This hybrid approach, utilizing ODE solvers from torchdiffeq
and a cosine timestep scheduler, aims for faster inference and improved audio quality compared to traditional diffusion or autoregressive models.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project is actively developed, with V1.1 released in September 2024 addressing significant quality improvements. Further updates, including a new autoregressive TTS model, are planned. Community links are not explicitly provided in the README.
Licensing & Compatibility
The project is released under the Apache-2.0 license. A disclaimer prohibits using the technology for generating or editing speech without consent, particularly for public figures, citing potential copyright law violations.
Limitations & Caveats
The project's disclaimer highlights ethical concerns and potential legal ramifications regarding unauthorized speech generation. Specific hardware requirements beyond PyTorch compatibility are not detailed, and training resource needs are not quantified.
10 months ago
1+ week