PyTorch implementation for text-to-speech synthesis
Top 9.7% on sourcepulse
This PyTorch implementation of Tacotron 2 provides a Natural TTS Synthesis system that conditions WaveNet on Mel Spectrogram Predictions. It is designed for researchers and developers working on text-to-speech synthesis, offering faster-than-realtime inference and leveraging NVIDIA's Apex and AMP for distributed and mixed-precision training.
How It Works
The system generates Mel Spectrograms from input text using a sequence-to-sequence model, which are then synthesized into audio by a vocoder (like WaveGlow). This approach decouples the acoustic modeling from the vocoding process, allowing for independent optimization and enabling faster inference. The use of Automatic Mixed Precision (AMP) and distributed training significantly speeds up the training process on multi-GPU setups.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
This project is maintained by NVIDIA. Related repositories include WaveGlow and nv-wavenet.
Licensing & Compatibility
The repository is released under a permissive license, allowing for commercial use and integration with closed-source projects.
Limitations & Caveats
The implementation relies on specific versions of PyTorch (1.0) and NVIDIA Apex, which may require careful environment management. The README mentions using a specific mel-spectrogram representation for Tacotron 2 and the Mel decoder, implying potential compatibility issues if different representations are used.
1 year ago
Inactive