ParallelWaveGAN  by kan-bayashi

Pytorch vocoder for real-time speech synthesis, based on Parallel WaveGAN

Created 5 years ago
1,620 stars

Top 25.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides unofficial PyTorch implementations of state-of-the-art non-autoregressive neural vocoders, including Parallel WaveGAN, MelGAN, Multi-band MelGAN, HiFi-GAN, and StyleMelGAN. It aims to enable real-time neural vocoding for text-to-speech and singing voice synthesis, offering compatibility with ESPnet-TTS and other Tacotron2-based implementations.

How It Works

The project implements various GAN-based vocoder architectures that generate audio waveforms from mel-spectrograms. These models leverage techniques like multi-band processing and adversarial training to achieve high-fidelity audio synthesis at fast inference speeds. The non-autoregressive nature of these models is key to their real-time performance.

Quick Start & Requirements

  • Install: pip install -e . (after git clone) or via make in the tools directory.
  • Prerequisites: Python 3.8+, CUDA 11.0+, CuDNN 8+, NCCL 2+, libsndfile, jq, sox. Tested with PyTorch 1.8.1 to 2.1.0.
  • Setup: Installation via pip is straightforward. Training recipes are provided, similar to ESPnet.
  • Docs: ESPnet2 Demo, ESPnet1 Demo, Muskits Demo

Highlighted Details

  • Supports multiple languages (English, Japanese, Mandarin, Korean) and singing voice synthesis.
  • Achieves very fast inference speeds, with RTF as low as 0.001 on GPU.
  • Offers numerous pre-trained models for various datasets and architectures.
  • Provides detailed recipes and examples for integration with ESPnet-TTS.

Maintenance & Community

The repository is maintained by Tomoki Hayashi (@kan-bayashi). Updates include new recipes and support for singing voice vocoders.

Licensing & Compatibility

The license of pre-trained models depends on the corpus used for training. Some codes are derived from ESPnet/Kaldi (Apache-2.0). Users must verify dataset licenses for commercial use.

Limitations & Caveats

The repository is unofficial. Users are responsible for checking dataset licenses for commercial use and potential legal disputes. The README notes that the terms of use of pre-trained models follow those of the respective training corpora.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

Amphion by open-mmlab

0.2%
9k
Toolkit for audio, music, and speech generation research
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.