FastDiff  by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

created 3 years ago
411 stars

Top 72.2% on sourcepulse

GitHubView on GitHub
Project Summary

FastDiff provides a PyTorch implementation of a fast, high-fidelity conditional diffusion model for speech synthesis. It is designed for researchers and developers working on advanced text-to-speech (TTS) systems, offering efficient generation and integration with existing TTS pipelines.

How It Works

FastDiff leverages conditional diffusion probabilistic models to achieve high-quality speech synthesis with improved efficiency. The core approach involves a diffusion process that iteratively refines a noisy signal into coherent speech, conditioned on input text. This method aims to balance generation speed with audio fidelity, a common challenge in diffusion-based models.

Quick Start & Requirements

  • Install: Clone the repository.
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN, PyTorch, librosa, NATSpeech.
  • Pretrained Models: Checkpoints for LJSpeech, LibriTTS, VCTK, and Tacotron are available.
  • Demo: An example notebook egs/demo_tacotron.ipynb is provided.
  • Docs: Configuration files for supported datasets are in modules/FastDiff/config/.

Highlighted Details

  • Implements FastDiff (IJCAI'22) for efficient, high-quality speech synthesis.
  • Supports multiple datasets (LJSpeech, LibriTTS, VCTK) with provided checkpoints.
  • Offers integration with other TTS models like Tacotron, Portaspeech, and DiffSpeech.
  • Includes options for fine-tuning and inference from text, mel-spectrograms, or WAV files.

Maintenance & Community

The project was accepted by IJCAI 2022. Follow-up work, ProDiff, is also available on GitHub. The repository is not officially supported by Tencent.

Licensing & Compatibility

The repository uses code from NATSpeech, Tacotron2, and DiffWave-Vocoder. The specific license is not explicitly stated in the README, but the disclaimer prohibits using the technology to generate speech without consent, implying potential legal restrictions on commercial use or distribution of generated audio.

Limitations & Caveats

The README notes that mel-preprocessing mismatches can lead to noisy output. Fine-tuning is recommended for better quality. The disclaimer highlights legal and ethical considerations regarding the generation of speech without consent.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
1 more.

metavoice-src by metavoiceio

0%
4k
TTS model for human-like, expressive speech
created 1 year ago
updated 1 year ago
Feedback? Help us improve.