VocGAN by rishikksh20

Real-time vocoder using a hierarchically-nested adversarial network

Created 5 years ago

321 stars

Top 84.7% on SourcePulse

Project Summary

This repository offers a modified implementation of VocGAN, a neural vocoder designed for high-fidelity, real-time audio synthesis. It targets researchers and developers working on text-to-speech (TTS) and voice cloning systems, aiming to provide a faster training alternative to the original VocGAN by leveraging a more efficient discriminator.

How It Works

The core modification involves replacing VocGAN's original hierarchically-nested discriminator with the discriminator from Full-Band MelGAN. This change is motivated by research indicating that the MelGAN discriminator offers significantly faster training times while still being capable of training a generator to produce high-fidelity audio. The generator architecture itself has also been slightly modified.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Preprocessing: python preprocess.py -c config/default.yaml -d [data's root path]
Training: python trainer.py -c [config yaml file] -n [name of the run]
Prerequisites: Python 3.6, PyTorch. Datasets require a sample rate of 22050Hz.
Pretrained models for LJSpeech, KSS, and VCTK datasets are available for download.
Official documentation and audio samples can be found via links in the README.

Highlighted Details

Modified VocGAN for faster training, achieving real-time audio generation.
MelGAN discriminator offers significantly faster training (e.g., 7.2 it/sec vs. 2.8 sec/it for VocGAN's discriminator on P100 with batch size 16).
Pretrained models available for English (LJSpeech, VCTK) and Korean (KSS) datasets.
Audio quality comparable to MelGAN at 300 epochs, though original VocGAN recommends up to 3000 epochs.

Maintenance & Community

The author is open to suggestions and modifications. For a more comprehensive TTS toolbox, Deepsync Technologies is referenced.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

This is a modified version of VocGAN; for the original implementation, refer to the baseline branch. The author notes that optimizing the baseline VocGAN discriminator by downsampling audio during preprocessing might further speed up training. Training the original VocGAN to the recommended 3000 epochs is currently infeasible due to slow training speeds.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days