radtts  by NVIDIA

Flow-based TTS recipes for training, inference, and voice conversion

created 3 years ago
288 stars

Top 92.1% on sourcepulse

GitHubView on GitHub
Project Summary

NVIDIA/radtts provides a normalizing-flow-based Text-to-Speech (TTS) framework, RADTTS, designed for high acoustic fidelity and robust alignment learning. It enables diverse synthesis and fine-grained control over speech attributes like fundamental frequency (F0) and energy, targeting researchers and developers in speech synthesis.

How It Works

RADTTS employs a normalizing-flow bipartite architecture to map text to mel spectrograms. It offers variants that condition on F0 and energy, and separate models for explicitly modeling text-conditional phoneme duration, F0, and energy. A key component is its standalone alignment module for unsupervised text-audio alignment, crucial for TTS training. The framework also integrates a HiFi-GAN vocoder for high-fidelity audio synthesis.

Quick Start & Requirements

  • Install via pip install -r requirements.txt.
  • Requires Python and PyTorch. Specific CUDA version not stated, but GPU is implied for training.
  • Data preparation involves updating filelists and JSON configs to point to data directories.
  • Training commands are provided for base RADTTS, duration predictor, and RADTTS++ with attribute predictors.
  • Inference and voice conversion demos are available.
  • Official project page and samples: [link not provided in README]
  • Relevant works: [link not provided in README]

Highlighted Details

  • State-of-the-art acoustic fidelity.
  • Highly robust audio-transcription alignment module.
  • Generative modeling for low-dimensional speech attributes (F0, energy).
  • Supports voice conversion.
  • Includes pre-trained HiFi-GAN vocoder checkpoints.

Maintenance & Community

  • Developed by NVIDIA.
  • No specific community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

  • MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions that more pre-trained RADTTS models are coming soon, implying current availability might be limited. Specific hardware requirements (e.g., GPU memory, CUDA version) are not explicitly detailed, and training complex models like these typically demands significant computational resources.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.