VoiceCraft  by jasonppy

Zero-shot speech editing and TTS research paper

created 1 year ago
8,351 stars

Top 6.3% on sourcepulse

GitHubView on GitHub
Project Summary

VoiceCraft is a zero-shot speech editing and text-to-speech (TTS) system designed for "in-the-wild" audio data like audiobooks and podcasts. It targets researchers and developers needing to clone or modify voices with minimal reference audio, offering state-of-the-art performance.

How It Works

VoiceCraft employs a token infilling neural codec language model. It leverages a few seconds of reference audio to clone or edit unseen voices. This approach allows for flexible manipulation of speech content and style without extensive training data for each new voice.

Quick Start & Requirements

  • Inference:
    • Google Colab notebooks for easy inference.
    • Docker image for Linux/Windows with NVIDIA Container Toolkit.
    • Standalone scripts (tts_demo.py, speech_editing_demo.py).
  • Environment Setup: Requires Python 3.9.16, PyTorch 2.0.1 (CUDA 11.7 compatible), audiocraft, xformers, torchaudio, tensorboard, phonemizer, datasets, torchmetrics, huggingface_hub, ffmpeg, espeak-ng, and Montreal Forced Aligner (MFA) with English models.
  • Resources: Training requires significant compute and storage for datasets like Gigaspeech. Inference is less demanding but still benefits from GPU acceleration.
  • Links: QuickStart Colab, HuggingFace Spaces Demo, Docker Quickstart.

Highlighted Details

  • Achieves state-of-the-art performance on in-the-wild audio data.
  • Zero-shot capability for speech editing and TTS with seconds of reference audio.
  • Offers Gradio UI for interactive demos and command-line interfaces for integration.
  • Supports both speech editing and text-to-speech tasks.

Maintenance & Community

  • Active development with recent updates (March-April 2024) including enhanced models and a Replicate demo.
  • Community contributions acknowledged via HuggingFace Spaces.
  • No explicit community channels (Discord/Slack) mentioned in the README.

Licensing & Compatibility

  • Codebase: CC BY-NC-SA 4.0 (Non-commercial, ShareAlike).
  • Model Weights: Coqui Public Model License 1.0.0.
  • Dependencies: Includes code under MIT and Apache 2.0 licenses. Phonemizer is under GNU 3.0.
  • Restrictions: Non-commercial use is strictly enforced for both code and models.

Limitations & Caveats

The CC BY-NC-SA 4.0 and Coqui Public Model License 1.0.0 restrict commercial use. The disclaimer explicitly prohibits using the technology to generate or edit speech without consent, particularly for public figures, warning of potential copyright violations. Training requires careful data preparation and significant computational resources.

Health Check
Last commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
137 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.