VoiceCraft  by jasonppy

Zero-shot speech editing and TTS research paper

Created 1 year ago
8,382 stars

Top 6.1% on SourcePulse

GitHubView on GitHub
Project Summary

VoiceCraft is a zero-shot speech editing and text-to-speech (TTS) system designed for "in-the-wild" audio data like audiobooks and podcasts. It targets researchers and developers needing to clone or modify voices with minimal reference audio, offering state-of-the-art performance.

How It Works

VoiceCraft employs a token infilling neural codec language model. It leverages a few seconds of reference audio to clone or edit unseen voices. This approach allows for flexible manipulation of speech content and style without extensive training data for each new voice.

Quick Start & Requirements

  • Inference:
    • Google Colab notebooks for easy inference.
    • Docker image for Linux/Windows with NVIDIA Container Toolkit.
    • Standalone scripts (tts_demo.py, speech_editing_demo.py).
  • Environment Setup: Requires Python 3.9.16, PyTorch 2.0.1 (CUDA 11.7 compatible), audiocraft, xformers, torchaudio, tensorboard, phonemizer, datasets, torchmetrics, huggingface_hub, ffmpeg, espeak-ng, and Montreal Forced Aligner (MFA) with English models.
  • Resources: Training requires significant compute and storage for datasets like Gigaspeech. Inference is less demanding but still benefits from GPU acceleration.
  • Links: QuickStart Colab, HuggingFace Spaces Demo, Docker Quickstart.

Highlighted Details

  • Achieves state-of-the-art performance on in-the-wild audio data.
  • Zero-shot capability for speech editing and TTS with seconds of reference audio.
  • Offers Gradio UI for interactive demos and command-line interfaces for integration.
  • Supports both speech editing and text-to-speech tasks.

Maintenance & Community

  • Active development with recent updates (March-April 2024) including enhanced models and a Replicate demo.
  • Community contributions acknowledged via HuggingFace Spaces.
  • No explicit community channels (Discord/Slack) mentioned in the README.

Licensing & Compatibility

  • Codebase: CC BY-NC-SA 4.0 (Non-commercial, ShareAlike).
  • Model Weights: Coqui Public Model License 1.0.0.
  • Dependencies: Includes code under MIT and Apache 2.0 licenses. Phonemizer is under GNU 3.0.
  • Restrictions: Non-commercial use is strictly enforced for both code and models.

Limitations & Caveats

The CC BY-NC-SA 4.0 and Coqui Public Model License 1.0.0 restrict commercial use. The disclaimer explicitly prohibits using the technology to generate or edit speech without consent, particularly for public figures, warning of potential copyright violations. Training requires careful data preparation and significant computational resources.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
29 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
1 more.

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
Created 1 year ago
Updated 1 week ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
51k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.