melgan  by seungwonpark

PyTorch implementation of MelGAN vocoder

created 5 years ago
646 stars

Top 52.5% on sourcepulse

GitHubView on GitHub
Project Summary

MelGAN is a PyTorch implementation of a fast and efficient neural vocoder, designed to convert mel-spectrograms into high-fidelity audio waveforms. It offers a lighter and faster alternative to models like WaveGlow, with improved generalization for unseen speakers. This implementation is compatible with NVIDIA's Tacotron2, allowing direct conversion of its mel-spectrogram outputs to audio.

How It Works

MelGAN employs a Generative Adversarial Network (GAN) architecture. It utilizes a generator network to synthesize audio from mel-spectrograms and a discriminator network to distinguish between real and generated audio. The key advantage lies in its efficient generator design, which enables faster inference and lower computational requirements compared to autoregressive models, while maintaining high audio quality.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.6, bash. Pretrained models are available via PyTorch Hub.
  • Dataset: Requires WAV files with a 22050Hz sample rate. Preprocessing is handled by preprocess.py.
  • Training: python trainer.py -c [config yaml file] -n [name of the run]
  • Inference: python inference.py -p [checkpoint path] -i [input mel path]
  • Resources: Training was conducted on a V100 GPU for 14 days on the LJSpeech-1.1 dataset.
  • Docs: http://swpark.me/melgan/

Highlighted Details

  • PyTorch Hub integration for easy access to pretrained models.
  • Identical mel-spectrogram function to NVIDIA/tacotron2 for seamless integration.
  • Claims to be lighter, faster, and better at generalizing to unseen speakers than WaveGlow.
  • Performance improvements noted by replacing average pooling with max pooling and reflection padding with replication padding.

Maintenance & Community

  • Implemented by Seungwon Park, Myunchul Joe, and Rishikesh.
  • Links to audio samples are provided.

Licensing & Compatibility

  • BSD 3-Clause License.
  • Code snippets from NVIDIA/waveglow and HarryVolek/PyTorch_Speaker_Verification are included, with the latter having an unspecified license.

Limitations & Caveats

The repository notes that the utils/hparams.py file is sourced from a project with an unspecified license, which may pose compatibility issues for commercial use. A Google Colab demo is marked as "TODO".

Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.