DiffGAN-TTS by keonlee9420

PyTorch implementation for text-to-speech using denoising diffusion GANs

Created 3 years ago

345 stars

Top 80.3% on SourcePulse

Project Summary

This repository provides a PyTorch implementation of DiffGAN-TTS, a text-to-speech system that combines Denoising Diffusion Probabilistic Models with Generative Adversarial Networks for high-fidelity and efficient audio synthesis. It is targeted at researchers and developers working on advanced TTS systems, offering controllable speech generation and supporting both single-speaker and multi-speaker scenarios.

How It Works

DiffGAN-TTS employs a two-stage approach. The first stage involves training a FastSpeech2-based model, including an auxiliary (Mel) decoder. The second stage utilizes a shallow diffusion mechanism, building upon the auxiliary decoder, to generate high-quality speech. This hybrid approach aims to leverage the efficiency of GANs with the high-fidelity generation capabilities of diffusion models.

Quick Start & Requirements

Install dependencies: pip3 install -r requirements.txt
Requires pre-trained models for inference.
Supports LJSpeech (single-speaker) and VCTK (multi-speaker) datasets.
Forced alignment requires Montreal Forced Aligner (MFA) or pre-extracted alignments.
Preprocessing: python3 preprocess.py --dataset DATASET
Inference: python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --dataset DATASET
Training: python3 train.py --model MODEL --dataset DATASET
Official documentation and audio samples are available within the repository.

Highlighted Details

Implements a "shallow diffusion mechanism" for efficient and high-fidelity TTS.
Offers controllability over pitch, volume, and speaking rate, inherited from FastSpeech2.
Supports multi-speaker TTS using external speaker embeddings (e.g., DeepSpeaker).
Demonstrates clear speaker identification with DeepSpeaker embeddings on VCTK.

Maintenance & Community

The repository is marked as "Active".
Key references include DiffSinger and FastSpeech2 implementations.
No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. It references other repositories which may have different licenses.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The implementation uses VCTK instead of the original paper's Mandarin Chinese dataset and adjusts the sample rate to 22050Hz. The controllability features are noted as originating from FastSpeech2, not being a core innovation of DiffGAN-TTS itself. The README mentions potential issues with lambda_fm causing model explosion if not fixed.

DiffGAN-TTS by keonlee9420

Explore Similar Projects

pheme by PolyAI-LDN

Meta-voicebox by SpeechifyInc

pits by anonymous-pits

Comprehensive-Transformer-TTS by keonlee9420

ProDiff by Rongjiehuang

FastDiff by Rongjiehuang

speech-synthesis-paper by wenet-e2e

HierSpeechpp by sh-lee-prml

KittenTTS by KittenML

StyleTTS2 by yl4579

Zonos by Zyphra

VibeVoice by microsoft