ProDiff by Rongjiehuang

PyTorch implementation for fast diffusion text-to-speech

Created 3 years ago

432 stars

Top 68.7% on SourcePulse

Project Summary

ProDiff offers an extremely fast, high-fidelity text-to-speech (TTS) pipeline for industrial deployment, leveraging conditional diffusion probabilistic models. It targets researchers and developers seeking efficient and high-quality speech synthesis solutions.

How It Works

ProDiff utilizes a two-stage approach: ProDiff (acoustic model) and FastDiff (neural vocoder). This combination allows for progressive diffusion, enabling rapid synthesis by controlling the number of reverse sampling steps in both models. This design prioritizes speed without significantly compromising speech quality, making it suitable for real-time or near-real-time applications.

Quick Start & Requirements

Install/Run: Clone the repository. Download checkpoints using snapshot_download from Hugging Face Hub and move them to the checkpoints/ directory.
Prerequisites: NVIDIA GPU with CUDA and cuDNN, PyTorch, librosa.
Setup: Requires downloading pre-trained models. Inference setup is straightforward following the provided commands.
Links: Demo Page, FastDiff

Highlighted Details

Extremely-fast diffusion TTS pipeline.
PyTorch implementation of ProDiff (ACM-MM'22) and FastDiff (IJCAI 2022).
Supports speed-quality trade-offs via adjustable sampling iterations.
Provides pre-trained models for LJSpeech.

Maintenance & Community

The project is associated with authors from multiple institutions, indicating potential academic backing.
Links to related projects like FastDiff and NATSpeech are provided.

Licensing & Compatibility

The repository does not explicitly state a license. However, the disclaimer prohibits using the technology to generate speech without consent, which may imply usage restrictions beyond typical open-source licenses.

Limitations & Caveats

The primary pre-trained model is for LJSpeech; support for more datasets is pending.
The disclaimer regarding consent for speech generation suggests potential legal or ethical considerations for commercial use.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Ming-UniAudio by inclusionAI

Unified speech LLM for understanding, generation, and editing

Created 3 months ago

Updated 1 month ago

Meta-voicebox by SpeechifyInc

PyTorch implementation of Meta's Voicebox speech model

Created 2 years ago

Updated 2 years ago

DMOSpeech2 by yl4579

Metric-optimized speech synthesis with RL

Created 5 months ago

Updated 5 months ago

DiffGAN-TTS by keonlee9420

PyTorch implementation for text-to-speech using denoising diffusion GANs

Created 3 years ago

Updated 3 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

FastDiff by Rongjiehuang

PyTorch implementation for fast, high-fidelity speech synthesis via conditional diffusion

Created 4 years ago

Updated 1 year ago

Starred by

Anastasios Angelopoulos

Anastasios Angelopoulos(Cofounder of LMArena).

naturalspeech2-pytorch by lucidrains

PyTorch implementation of Natural Speech 2, a zero-shot speech/singing synthesizer

Created 2 years ago

Updated 2 years ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

HierSpeechpp by sh-lee-prml

PyTorch for zero-shot TTS/voice conversion research

Created 2 years ago

Updated 1 year ago

TransformerTTS by spring-media

TensorFlow 2 implementation for non-autoregressive text-to-speech

Created 5 years ago

Updated 1 year ago

MegaTTS3 by bytedance

PyTorch implementation for zero-shot speech synthesis

Created 9 months ago

Updated 4 months ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

4 more.

StyleTTS2 by yl4579

Text-to-speech model achieving human-level synthesis

Created 2 years ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

vits by jaywalnut310

End-to-end text-to-speech via conditional variational autoencoder

Created 4 years ago

Updated 2 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral),

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs), and

3 more.

espnet by espnet

End-to-end speech processing toolkit for various speech tasks

Created 8 years ago

Updated 3 weeks ago

Feedback? Help us improve.