Real-time vocoder using a hierarchically-nested adversarial network
Top 86.0% on sourcepulse
This repository offers a modified implementation of VocGAN, a neural vocoder designed for high-fidelity, real-time audio synthesis. It targets researchers and developers working on text-to-speech (TTS) and voice cloning systems, aiming to provide a faster training alternative to the original VocGAN by leveraging a more efficient discriminator.
How It Works
The core modification involves replacing VocGAN's original hierarchically-nested discriminator with the discriminator from Full-Band MelGAN. This change is motivated by research indicating that the MelGAN discriminator offers significantly faster training times while still being capable of training a generator to produce high-fidelity audio. The generator architecture itself has also been slightly modified.
Quick Start & Requirements
pip install -r requirements.txt
python preprocess.py -c config/default.yaml -d [data's root path]
python trainer.py -c [config yaml file] -n [name of the run]
Highlighted Details
Maintenance & Community
The author is open to suggestions and modifications. For a more comprehensive TTS toolbox, Deepsync Technologies is referenced.
Licensing & Compatibility
The repository does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
This is a modified version of VocGAN; for the original implementation, refer to the baseline
branch. The author notes that optimizing the baseline VocGAN discriminator by downsampling audio during preprocessing might further speed up training. Training the original VocGAN to the recommended 3000 epochs is currently infeasible due to slow training speeds.
1 year ago
Inactive