Flow-based TTS recipes for training, inference, and voice conversion
Top 92.1% on sourcepulse
NVIDIA/radtts provides a normalizing-flow-based Text-to-Speech (TTS) framework, RADTTS, designed for high acoustic fidelity and robust alignment learning. It enables diverse synthesis and fine-grained control over speech attributes like fundamental frequency (F0) and energy, targeting researchers and developers in speech synthesis.
How It Works
RADTTS employs a normalizing-flow bipartite architecture to map text to mel spectrograms. It offers variants that condition on F0 and energy, and separate models for explicitly modeling text-conditional phoneme duration, F0, and energy. A key component is its standalone alignment module for unsupervised text-audio alignment, crucial for TTS training. The framework also integrates a HiFi-GAN vocoder for high-fidelity audio synthesis.
Quick Start & Requirements
pip install -r requirements.txt
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README mentions that more pre-trained RADTTS models are coming soon, implying current availability might be limited. Specific hardware requirements (e.g., GPU memory, CUDA version) are not explicitly detailed, and training complex models like these typically demands significant computational resources.
2 years ago
1 day