glow-tts  by jaywalnut310

Generative flow for text-to-speech research paper

Created 5 years ago
699 stars

Top 48.8% on SourcePulse

GitHubView on GitHub
Project Summary

Glow-TTS is a flow-based generative model for text-to-speech (TTS) that synthesizes mel-spectrograms in parallel without requiring external aligners. It targets researchers and developers seeking fast, diverse, and controllable speech synthesis, offering an order-of-magnitude speed-up over autoregressive models like Tacotron 2 with comparable quality.

How It Works

Glow-TTS leverages generative flows and dynamic programming to perform monotonic alignment search internally. This approach allows the model to find the most probable alignment between text and speech latent representations on its own, eliminating the need for pre-trained alignment models. Enforcing hard monotonic alignments contributes to robust TTS, generalizing well to long utterances, while the flow-based architecture enables efficient and controllable speech generation.

Quick Start & Requirements

  • Install: Clone the repository and initialize submodules (git submodule init; git submodule update). Build the Cython monotonic alignment search code (cd monotonic_align; python setup.py build_ext --inplace).
  • Prerequisites: Python 3.6.9, PyTorch 1.2.0, Cython 0.29.12, librosa 0.7.1, NumPy 1.16.4, SciPy 1.3.0. Requires LJ Speech dataset. Mixed-precision training uses Apex. WaveGlow model and HiFi-GAN vocoder (fine-tuned with Tacotron 2) are recommended for improved quality.
  • Resources: Training requires significant computational resources. Inference is fast.
  • Links: Demo, Pretrained Model, HiFi-GAN Repo

Highlighted Details

  • Achieves an order-of-magnitude speed-up over Tacotron 2 during synthesis.
  • Eliminates the need for external aligners, a common requirement for parallel TTS models.
  • Easily extendable to a multi-speaker setting.
  • Update notes suggest using HiFi-GAN for reduced noise and adding blank tokens between input tokens for improved pronunciation.

Maintenance & Community

The project is associated with authors from Seoul National University. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, its dependencies (WaveGlow, Tensor2Tensor, Mellotron) are typically released under permissive licenses like MIT or Apache 2.0. Compatibility for commercial use or closed-source linking would require explicit license confirmation.

Limitations & Caveats

The project relies on older versions of PyTorch (1.2.0) and Python (3.6.9), which may pose compatibility challenges with modern environments. The README mentions specific modifications for quality improvement (HiFi-GAN, blank tokens) that were not included in the original paper, suggesting potential ongoing development or refinement.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.