PlayDiffusion  by playht

Diffusion model for speech editing

created 2 months ago
510 stars

Top 62.0% on sourcepulse

GitHubView on GitHub
Project Summary

PlayDiffusion offers a novel diffusion-based approach for editing speech audio, enabling fine-grained modifications like word replacement without introducing discontinuities or altering prosody. It targets researchers and developers building advanced speech editing tools, providing a more coherent and natural audio output compared to traditional autoregressive models.

How It Works

The core innovation lies in a non-autoregressive diffusion model that operates on discrete audio tokens. Audio is first encoded into a compact token sequence. When an edit is needed, the corresponding tokens are masked, and the diffusion model, conditioned on updated text and speaker embeddings, denoises the masked region. This process preserves surrounding context, ensuring seamless transitions. Finally, a BigVGAN decoder reconstructs the token sequence into a waveform. This non-autoregressive strategy allows the model to leverage future context, improving edit quality.

Quick Start & Requirements

  • Install: pip install '.[demo]'
  • Requirements: Python 3.11, OPENAI_API_KEY environment variable (for ASR/timing, or alternative source), Hugging Face Gradio and checkpoints. GPU is implicitly required for diffusion models.
  • Docker: Provided build and run commands for Docker/Podman, including GPU device mapping and volume mounts for Hugging Face/Whisper caches.
  • Demo: Run with python demo/gradio-demo.py.

Highlighted Details

  • Addresses limitations of autoregressive models for speech inpainting and editing.
  • Employs non-causal attention heads in a modified Llama architecture.
  • Utilizes a custom BPE tokenizer with 10,000 tokens for efficiency.
  • Incorporates speaker conditioning for consistent voice identity.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is present in the README.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The project is presented as a novel approach, and its stability, performance benchmarks, and real-world applicability beyond the demo are not detailed. The dependency on specific Hugging Face models and checkpoints may also imply potential compatibility issues with future updates.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
517 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.3%
3k
Audio generation research paper using latent diffusion
created 2 years ago
updated 1 month ago
Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
1 more.

metavoice-src by metavoiceio

0%
4k
TTS model for human-like, expressive speech
created 1 year ago
updated 1 year ago
Feedback? Help us improve.