DiffGAN-TTS  by keonlee9420

PyTorch implementation for text-to-speech using denoising diffusion GANs

created 3 years ago
336 stars

Top 83.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of DiffGAN-TTS, a text-to-speech system that combines Denoising Diffusion Probabilistic Models with Generative Adversarial Networks for high-fidelity and efficient audio synthesis. It is targeted at researchers and developers working on advanced TTS systems, offering controllable speech generation and supporting both single-speaker and multi-speaker scenarios.

How It Works

DiffGAN-TTS employs a two-stage approach. The first stage involves training a FastSpeech2-based model, including an auxiliary (Mel) decoder. The second stage utilizes a shallow diffusion mechanism, building upon the auxiliary decoder, to generate high-quality speech. This hybrid approach aims to leverage the efficiency of GANs with the high-fidelity generation capabilities of diffusion models.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Requires pre-trained models for inference.
  • Supports LJSpeech (single-speaker) and VCTK (multi-speaker) datasets.
  • Forced alignment requires Montreal Forced Aligner (MFA) or pre-extracted alignments.
  • Preprocessing: python3 preprocess.py --dataset DATASET
  • Inference: python3 synthesize.py --text "YOUR_DESIRED_TEXT" --model MODEL --restore_step RESTORE_STEP --dataset DATASET
  • Training: python3 train.py --model MODEL --dataset DATASET
  • Official documentation and audio samples are available within the repository.

Highlighted Details

  • Implements a "shallow diffusion mechanism" for efficient and high-fidelity TTS.
  • Offers controllability over pitch, volume, and speaking rate, inherited from FastSpeech2.
  • Supports multi-speaker TTS using external speaker embeddings (e.g., DeepSpeaker).
  • Demonstrates clear speaker identification with DeepSpeaker embeddings on VCTK.

Maintenance & Community

  • The repository is marked as "Active".
  • Key references include DiffSinger and FastSpeech2 implementations.
  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. It references other repositories which may have different licenses.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The implementation uses VCTK instead of the original paper's Mandarin Chinese dataset and adjusts the sample rate to 22050Hz. The controllability features are noted as originating from FastSpeech2, not being a core innovation of DiffGAN-TTS itself. The README mentions potential issues with lambda_fm causing model explosion if not fixed.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.