PlayDiffusion by playht

Diffusion model for speech editing

Created 7 months ago

530 stars

Top 59.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

PlayDiffusion offers a novel diffusion-based approach for editing speech audio, enabling fine-grained modifications like word replacement without introducing discontinuities or altering prosody. It targets researchers and developers building advanced speech editing tools, providing a more coherent and natural audio output compared to traditional autoregressive models.

How It Works

The core innovation lies in a non-autoregressive diffusion model that operates on discrete audio tokens. Audio is first encoded into a compact token sequence. When an edit is needed, the corresponding tokens are masked, and the diffusion model, conditioned on updated text and speaker embeddings, denoises the masked region. This process preserves surrounding context, ensuring seamless transitions. Finally, a BigVGAN decoder reconstructs the token sequence into a waveform. This non-autoregressive strategy allows the model to leverage future context, improving edit quality.

Quick Start & Requirements

Install: pip install '.[demo]'
Requirements: Python 3.11, OPENAI_API_KEY environment variable (for ASR/timing, or alternative source), Hugging Face Gradio and checkpoints. GPU is implicitly required for diffusion models.
Docker: Provided build and run commands for Docker/Podman, including GPU device mapping and volume mounts for Hugging Face/Whisper caches.
Demo: Run with python demo/gradio-demo.py.

Highlighted Details

Addresses limitations of autoregressive models for speech inpainting and editing.
Employs non-causal attention heads in a modified Llama architecture.
Utilizes a custom BPE tokenizer with 10,000 tokens for efficiency.
Incorporates speaker conditioning for consistent voice identity.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is present in the README.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

The project is presented as a novel approach, and its stability, performance benchmarks, and real-world applicability beyond the demo are not detailed. The dependency on specific Hugging Face models and checkpoints may also imply potential compatibility issues with future updates.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days