glow-tts by jaywalnut310

Generative flow for text-to-speech research paper

Created 5 years ago

702 stars

Top 48.6% on SourcePulse

Project Summary

Glow-TTS is a flow-based generative model for text-to-speech (TTS) that synthesizes mel-spectrograms in parallel without requiring external aligners. It targets researchers and developers seeking fast, diverse, and controllable speech synthesis, offering an order-of-magnitude speed-up over autoregressive models like Tacotron 2 with comparable quality.

How It Works

Glow-TTS leverages generative flows and dynamic programming to perform monotonic alignment search internally. This approach allows the model to find the most probable alignment between text and speech latent representations on its own, eliminating the need for pre-trained alignment models. Enforcing hard monotonic alignments contributes to robust TTS, generalizing well to long utterances, while the flow-based architecture enables efficient and controllable speech generation.

Quick Start & Requirements

Install: Clone the repository and initialize submodules (git submodule init; git submodule update). Build the Cython monotonic alignment search code (cd monotonic_align; python setup.py build_ext --inplace).
Prerequisites: Python 3.6.9, PyTorch 1.2.0, Cython 0.29.12, librosa 0.7.1, NumPy 1.16.4, SciPy 1.3.0. Requires LJ Speech dataset. Mixed-precision training uses Apex. WaveGlow model and HiFi-GAN vocoder (fine-tuned with Tacotron 2) are recommended for improved quality.
Resources: Training requires significant computational resources. Inference is fast.
Links: Demo, Pretrained Model, HiFi-GAN Repo

Highlighted Details

Achieves an order-of-magnitude speed-up over Tacotron 2 during synthesis.
Eliminates the need for external aligners, a common requirement for parallel TTS models.
Easily extendable to a multi-speaker setting.
Update notes suggest using HiFi-GAN for reduced noise and adding blank tokens between input tokens for improved pronunciation.

Maintenance & Community

The project is associated with authors from Seoul National University. Links to community resources like Discord or Slack are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. However, its dependencies (WaveGlow, Tensor2Tensor, Mellotron) are typically released under permissive licenses like MIT or Apache 2.0. Compatibility for commercial use or closed-source linking would require explicit license confirmation.

Limitations & Caveats

The project relies on older versions of PyTorch (1.2.0) and Python (3.6.9), which may pose compatibility challenges with modern environments. The README mentions specific modifications for quality improvement (HiFi-GAN, blank tokens) that were not included in the original paper, suggesting potential ongoing development or refinement.

glow-tts by jaywalnut310

Explore Similar Projects

pheme by PolyAI-LDN

Meta-voicebox by SpeechifyInc

PortaSpeech by keonlee9420

GenerSpeech by Rongjiehuang

DiffGAN-TTS by keonlee9420

FastDiff by Rongjiehuang

StyleTTS by yl4579

speech-synthesis-paper by wenet-e2e

flowtron by NVIDIA

ParallelWaveGAN by kan-bayashi

metavoice-src by metavoiceio

vits by jaywalnut310