Pytorch vocoder for real-time speech synthesis, based on Parallel WaveGAN
Top 26.6% on sourcepulse
This repository provides unofficial PyTorch implementations of state-of-the-art non-autoregressive neural vocoders, including Parallel WaveGAN, MelGAN, Multi-band MelGAN, HiFi-GAN, and StyleMelGAN. It aims to enable real-time neural vocoding for text-to-speech and singing voice synthesis, offering compatibility with ESPnet-TTS and other Tacotron2-based implementations.
How It Works
The project implements various GAN-based vocoder architectures that generate audio waveforms from mel-spectrograms. These models leverage techniques like multi-band processing and adversarial training to achieve high-fidelity audio synthesis at fast inference speeds. The non-autoregressive nature of these models is key to their real-time performance.
Quick Start & Requirements
pip install -e .
(after git clone
) or via make
in the tools
directory.libsndfile
, jq
, sox
. Tested with PyTorch 1.8.1 to 2.1.0.Highlighted Details
Maintenance & Community
The repository is maintained by Tomoki Hayashi (@kan-bayashi). Updates include new recipes and support for singing voice vocoders.
Licensing & Compatibility
The license of pre-trained models depends on the corpus used for training. Some codes are derived from ESPnet/Kaldi (Apache-2.0). Users must verify dataset licenses for commercial use.
Limitations & Caveats
The repository is unofficial. Users are responsible for checking dataset licenses for commercial use and potential legal disputes. The README notes that the terms of use of pre-trained models follow those of the respective training corpora.
1 year ago
Inactive