VoiceCraft by jasonppy

Zero-shot speech editing and TTS research paper

Created 1 year ago

8,452 stars

Top 6.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Luis Capelo

Cofounder of Lightning AI

Pawel Garbacki

Cofounder of Fireworks AI

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

VoiceCraft is a zero-shot speech editing and text-to-speech (TTS) system designed for "in-the-wild" audio data like audiobooks and podcasts. It targets researchers and developers needing to clone or modify voices with minimal reference audio, offering state-of-the-art performance.

How It Works

VoiceCraft employs a token infilling neural codec language model. It leverages a few seconds of reference audio to clone or edit unseen voices. This approach allows for flexible manipulation of speech content and style without extensive training data for each new voice.

Quick Start & Requirements

Inference:
- Google Colab notebooks for easy inference.
- Docker image for Linux/Windows with NVIDIA Container Toolkit.
- Standalone scripts (tts_demo.py, speech_editing_demo.py).
Environment Setup: Requires Python 3.9.16, PyTorch 2.0.1 (CUDA 11.7 compatible), audiocraft, xformers, torchaudio, tensorboard, phonemizer, datasets, torchmetrics, huggingface_hub, ffmpeg, espeak-ng, and Montreal Forced Aligner (MFA) with English models.
Resources: Training requires significant compute and storage for datasets like Gigaspeech. Inference is less demanding but still benefits from GPU acceleration.
Links: QuickStart Colab, HuggingFace Spaces Demo, Docker Quickstart.

Highlighted Details

Achieves state-of-the-art performance on in-the-wild audio data.
Zero-shot capability for speech editing and TTS with seconds of reference audio.
Offers Gradio UI for interactive demos and command-line interfaces for integration.
Supports both speech editing and text-to-speech tasks.

Maintenance & Community

Active development with recent updates (March-April 2024) including enhanced models and a Replicate demo.
Community contributions acknowledged via HuggingFace Spaces.
No explicit community channels (Discord/Slack) mentioned in the README.

Licensing & Compatibility

Codebase: CC BY-NC-SA 4.0 (Non-commercial, ShareAlike).
Model Weights: Coqui Public Model License 1.0.0.
Dependencies: Includes code under MIT and Apache 2.0 licenses. Phonemizer is under GNU 3.0.
Restrictions: Non-commercial use is strictly enforced for both code and models.

Limitations & Caveats

The CC BY-NC-SA 4.0 and Coqui Public Model License 1.0.0 restrict commercial use. The disclaimer explicitly prohibits using the technology to generate or edit speech without consent, particularly for public figures, warning of potential copyright violations. Training requires careful data preparation and significant computational resources.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

26 stars in the last 30 days