melgan by seungwonpark

PyTorch implementation of MelGAN vocoder

Created 6 years ago

650 stars

Top 51.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Casper Hansen

Author of AutoAWQ

Project Summary

MelGAN is a PyTorch implementation of a fast and efficient neural vocoder, designed to convert mel-spectrograms into high-fidelity audio waveforms. It offers a lighter and faster alternative to models like WaveGlow, with improved generalization for unseen speakers. This implementation is compatible with NVIDIA's Tacotron2, allowing direct conversion of its mel-spectrogram outputs to audio.

How It Works

MelGAN employs a Generative Adversarial Network (GAN) architecture. It utilizes a generator network to synthesize audio from mel-spectrograms and a discriminator network to distinguish between real and generated audio. The key advantage lies in its efficient generator design, which enables faster inference and lower computational requirements compared to autoregressive models, while maintaining high audio quality.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.6, bash. Pretrained models are available via PyTorch Hub.
Dataset: Requires WAV files with a 22050Hz sample rate. Preprocessing is handled by preprocess.py.
Training: python trainer.py -c [config yaml file] -n [name of the run]
Inference: python inference.py -p [checkpoint path] -i [input mel path]
Resources: Training was conducted on a V100 GPU for 14 days on the LJSpeech-1.1 dataset.
Docs: http://swpark.me/melgan/

Highlighted Details

PyTorch Hub integration for easy access to pretrained models.
Identical mel-spectrogram function to NVIDIA/tacotron2 for seamless integration.
Claims to be lighter, faster, and better at generalizing to unseen speakers than WaveGlow.
Performance improvements noted by replacing average pooling with max pooling and reflection padding with replication padding.

Maintenance & Community

Implemented by Seungwon Park, Myunchul Joe, and Rishikesh.
Links to audio samples are provided.

Licensing & Compatibility

BSD 3-Clause License.
Code snippets from NVIDIA/waveglow and HarryVolek/PyTorch_Speaker_Verification are included, with the latter having an unspecified license.

Limitations & Caveats

The repository notes that the utils/hparams.py file is sourced from a project with an unspecified license, which may pose compatibility issues for commercial use. A Google Colab demo is marked as "TODO".

Health Check

Last Commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days