PyTorch implementation for text-to-speech using denoising diffusion GANs
Top 83.0% on sourcepulse
This repository provides a PyTorch implementation of DiffGAN-TTS, a text-to-speech system that combines Denoising Diffusion Probabilistic Models with Generative Adversarial Networks for high-fidelity and efficient audio synthesis. It is targeted at researchers and developers working on advanced TTS systems, offering controllable speech generation and supporting both single-speaker and multi-speaker scenarios.
How It Works
DiffGAN-TTS employs a two-stage approach. The first stage involves training a FastSpeech2-based model, including an auxiliary (Mel) decoder. The second stage utilizes a shallow diffusion mechanism, building upon the auxiliary decoder, to generate high-quality speech. This hybrid approach aims to leverage the efficiency of GANs with the high-fidelity generation capabilities of diffusion models.
Quick Start & Requirements
pip3 install -r requirements.txt
python3 preprocess.py --dataset DATASET
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --dataset DATASET
python3 train.py --model MODEL --dataset DATASET
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The implementation uses VCTK instead of the original paper's Mandarin Chinese dataset and adjusts the sample rate to 22050Hz. The controllability features are noted as originating from FastSpeech2, not being a core innovation of DiffGAN-TTS itself. The README mentions potential issues with lambda_fm
causing model explosion if not fixed.
3 years ago
Inactive