Generative flow for text-to-speech research paper
Top 50.0% on sourcepulse
Glow-TTS is a flow-based generative model for text-to-speech (TTS) that synthesizes mel-spectrograms in parallel without requiring external aligners. It targets researchers and developers seeking fast, diverse, and controllable speech synthesis, offering an order-of-magnitude speed-up over autoregressive models like Tacotron 2 with comparable quality.
How It Works
Glow-TTS leverages generative flows and dynamic programming to perform monotonic alignment search internally. This approach allows the model to find the most probable alignment between text and speech latent representations on its own, eliminating the need for pre-trained alignment models. Enforcing hard monotonic alignments contributes to robust TTS, generalizing well to long utterances, while the flow-based architecture enables efficient and controllable speech generation.
Quick Start & Requirements
git submodule init; git submodule update
). Build the Cython monotonic alignment search code (cd monotonic_align; python setup.py build_ext --inplace
).Highlighted Details
Maintenance & Community
The project is associated with authors from Seoul National University. Links to community resources like Discord or Slack are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. However, its dependencies (WaveGlow, Tensor2Tensor, Mellotron) are typically released under permissive licenses like MIT or Apache 2.0. Compatibility for commercial use or closed-source linking would require explicit license confirmation.
Limitations & Caveats
The project relies on older versions of PyTorch (1.2.0) and Python (3.6.9), which may pose compatibility challenges with modern environments. The README mentions specific modifications for quality improvement (HiFi-GAN, blank tokens) that were not included in the original paper, suggesting potential ongoing development or refinement.
3 years ago
Inactive