ParallelWaveGAN  by kan-bayashi

Pytorch vocoder for real-time speech synthesis, based on Parallel WaveGAN

created 5 years ago
1,612 stars

Top 26.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides unofficial PyTorch implementations of state-of-the-art non-autoregressive neural vocoders, including Parallel WaveGAN, MelGAN, Multi-band MelGAN, HiFi-GAN, and StyleMelGAN. It aims to enable real-time neural vocoding for text-to-speech and singing voice synthesis, offering compatibility with ESPnet-TTS and other Tacotron2-based implementations.

How It Works

The project implements various GAN-based vocoder architectures that generate audio waveforms from mel-spectrograms. These models leverage techniques like multi-band processing and adversarial training to achieve high-fidelity audio synthesis at fast inference speeds. The non-autoregressive nature of these models is key to their real-time performance.

Quick Start & Requirements

  • Install: pip install -e . (after git clone) or via make in the tools directory.
  • Prerequisites: Python 3.8+, CUDA 11.0+, CuDNN 8+, NCCL 2+, libsndfile, jq, sox. Tested with PyTorch 1.8.1 to 2.1.0.
  • Setup: Installation via pip is straightforward. Training recipes are provided, similar to ESPnet.
  • Docs: ESPnet2 Demo, ESPnet1 Demo, Muskits Demo

Highlighted Details

  • Supports multiple languages (English, Japanese, Mandarin, Korean) and singing voice synthesis.
  • Achieves very fast inference speeds, with RTF as low as 0.001 on GPU.
  • Offers numerous pre-trained models for various datasets and architectures.
  • Provides detailed recipes and examples for integration with ESPnet-TTS.

Maintenance & Community

The repository is maintained by Tomoki Hayashi (@kan-bayashi). Updates include new recipes and support for singing voice vocoders.

Licensing & Compatibility

The license of pre-trained models depends on the corpus used for training. Some codes are derived from ESPnet/Kaldi (Apache-2.0). Users must verify dataset licenses for commercial use.

Limitations & Caveats

The repository is unofficial. Users are responsible for checking dataset licenses for commercial use and potential legal disputes. The README notes that the terms of use of pre-trained models follow those of the respective training corpora.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
1 more.

espnet by espnet

0.3%
9k
End-to-end speech processing toolkit for various speech tasks
created 7 years ago
updated 4 days ago
Feedback? Help us improve.