flowtron  by NVIDIA

TTS research paper using flow-based generative network

created 5 years ago
897 stars

Top 41.3% on sourcepulse

GitHubView on GitHub
Project Summary

Flowtron is an autoregressive, flow-based generative network for text-to-speech (TTS) synthesis, designed for researchers and developers seeking high-quality, expressive speech with controllable variations. It offers style transfer capabilities and aims to match state-of-the-art TTS models in speech quality.

How It Works

Flowtron builds upon Tacotron and autoregressive flows, creating an invertible mapping from data to a latent space. This latent space can be manipulated to control speech characteristics like pitch, tone, speech rate, and accent. The model is optimized by maximizing the likelihood of the training data, ensuring simple and stable training.

Quick Start & Requirements

  • Install: Clone the repository, initialize submodules (git submodule update --init), and install requirements (pip install -r requirements.txt).
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN.
  • Setup: Requires cloning the repository and initializing submodules.
  • Resources: Training requires a dataset and configuration. Links to pre-trained models (LJS, LibriTTS) are provided.
  • Docs: Website for audio samples.

Highlighted Details

  • Autoregressive flow-based generative network for TTS.
  • Control over speech variation, interpolation, and style transfer.
  • Matches state-of-the-art TTS models in Mean Opinion Scores (MOS).
  • Supports multi-GPU and Automatic Mixed Precision (AMP) training.

Maintenance & Community

This project is from NVIDIA. No specific community links or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. It mentions using code from other repositories, implying potential licensing considerations. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The project appears to be research-oriented, and extensive fine-tuning or specific dataset preparation might be required for optimal performance.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.