radtts by NVIDIA

Flow-based TTS recipes for training, inference, and voice conversion

Created 3 years ago

291 stars

Top 90.9% on SourcePulse

Project Summary

NVIDIA/radtts provides a normalizing-flow-based Text-to-Speech (TTS) framework, RADTTS, designed for high acoustic fidelity and robust alignment learning. It enables diverse synthesis and fine-grained control over speech attributes like fundamental frequency (F0) and energy, targeting researchers and developers in speech synthesis.

How It Works

RADTTS employs a normalizing-flow bipartite architecture to map text to mel spectrograms. It offers variants that condition on F0 and energy, and separate models for explicitly modeling text-conditional phoneme duration, F0, and energy. A key component is its standalone alignment module for unsupervised text-audio alignment, crucial for TTS training. The framework also integrates a HiFi-GAN vocoder for high-fidelity audio synthesis.

Quick Start & Requirements

Install via pip install -r requirements.txt.
Requires Python and PyTorch. Specific CUDA version not stated, but GPU is implied for training.
Data preparation involves updating filelists and JSON configs to point to data directories.
Training commands are provided for base RADTTS, duration predictor, and RADTTS++ with attribute predictors.
Inference and voice conversion demos are available.
Official project page and samples: [link not provided in README]
Relevant works: [link not provided in README]

Highlighted Details

State-of-the-art acoustic fidelity.
Highly robust audio-transcription alignment module.
Generative modeling for low-dimensional speech attributes (F0, energy).
Supports voice conversion.
Includes pre-trained HiFi-GAN vocoder checkpoints.

Maintenance & Community

Developed by NVIDIA.
No specific community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions that more pre-trained RADTTS models are coming soon, implying current availability might be limited. Specific hardware requirements (e.g., GPU memory, CUDA version) are not explicitly detailed, and training complex models like these typically demands significant computational resources.

radtts by NVIDIA

Explore Similar Projects

PortaSpeech by keonlee9420

StableTTS by KdaiP

tts by inworld-ai

vits2_pytorch by p0p4k

diffwave by lmnt-com

BigVGAN by NVIDIA

glow-tts by jaywalnut310

ultravox by fixie-ai

metavoice-src by metavoiceio

TTS by mozilla

tacotron2 by NVIDIA

TTS by coqui-ai