glow-tts  by jaywalnut310

Generative flow for text-to-speech research paper

created 5 years ago
695 stars

Top 50.0% on sourcepulse

GitHubView on GitHub
Project Summary

Glow-TTS is a flow-based generative model for text-to-speech (TTS) that synthesizes mel-spectrograms in parallel without requiring external aligners. It targets researchers and developers seeking fast, diverse, and controllable speech synthesis, offering an order-of-magnitude speed-up over autoregressive models like Tacotron 2 with comparable quality.

How It Works

Glow-TTS leverages generative flows and dynamic programming to perform monotonic alignment search internally. This approach allows the model to find the most probable alignment between text and speech latent representations on its own, eliminating the need for pre-trained alignment models. Enforcing hard monotonic alignments contributes to robust TTS, generalizing well to long utterances, while the flow-based architecture enables efficient and controllable speech generation.

Quick Start & Requirements

  • Install: Clone the repository and initialize submodules (git submodule init; git submodule update). Build the Cython monotonic alignment search code (cd monotonic_align; python setup.py build_ext --inplace).
  • Prerequisites: Python 3.6.9, PyTorch 1.2.0, Cython 0.29.12, librosa 0.7.1, NumPy 1.16.4, SciPy 1.3.0. Requires LJ Speech dataset. Mixed-precision training uses Apex. WaveGlow model and HiFi-GAN vocoder (fine-tuned with Tacotron 2) are recommended for improved quality.
  • Resources: Training requires significant computational resources. Inference is fast.
  • Links: Demo, Pretrained Model, HiFi-GAN Repo

Highlighted Details

  • Achieves an order-of-magnitude speed-up over Tacotron 2 during synthesis.
  • Eliminates the need for external aligners, a common requirement for parallel TTS models.
  • Easily extendable to a multi-speaker setting.
  • Update notes suggest using HiFi-GAN for reduced noise and adding blank tokens between input tokens for improved pronunciation.

Maintenance & Community

The project is associated with authors from Seoul National University. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, its dependencies (WaveGlow, Tensor2Tensor, Mellotron) are typically released under permissive licenses like MIT or Apache 2.0. Compatibility for commercial use or closed-source linking would require explicit license confirmation.

Limitations & Caveats

The project relies on older versions of PyTorch (1.2.0) and Python (3.6.9), which may pose compatibility challenges with modern environments. The README mentions specific modifications for quality improvement (HiFi-GAN, blank tokens) that were not included in the original paper, suggesting potential ongoing development or refinement.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.