Discover and explore top open-source AI tools and projects—updated daily.
keonlee9420PyTorch implementation for text-to-speech using denoising diffusion GANs
Top 80.9% on SourcePulse
This repository provides a PyTorch implementation of DiffGAN-TTS, a text-to-speech system that combines Denoising Diffusion Probabilistic Models with Generative Adversarial Networks for high-fidelity and efficient audio synthesis. It is targeted at researchers and developers working on advanced TTS systems, offering controllable speech generation and supporting both single-speaker and multi-speaker scenarios.
How It Works
DiffGAN-TTS employs a two-stage approach. The first stage involves training a FastSpeech2-based model, including an auxiliary (Mel) decoder. The second stage utilizes a shallow diffusion mechanism, building upon the auxiliary decoder, to generate high-quality speech. This hybrid approach aims to leverage the efficiency of GANs with the high-fidelity generation capabilities of diffusion models.
Quick Start & Requirements
pip3 install -r requirements.txtpython3 preprocess.py --dataset DATASETpython3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --dataset DATASETpython3 train.py --model MODEL --dataset DATASETHighlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The implementation uses VCTK instead of the original paper's Mandarin Chinese dataset and adjusts the sample rate to 22050Hz. The controllability features are noted as originating from FastSpeech2, not being a core innovation of DiffGAN-TTS itself. The README mentions potential issues with lambda_fm causing model explosion if not fixed.
3 years ago
Inactive
yl4579