Zero-shot speech editing and TTS research paper
Top 6.3% on sourcepulse
VoiceCraft is a zero-shot speech editing and text-to-speech (TTS) system designed for "in-the-wild" audio data like audiobooks and podcasts. It targets researchers and developers needing to clone or modify voices with minimal reference audio, offering state-of-the-art performance.
How It Works
VoiceCraft employs a token infilling neural codec language model. It leverages a few seconds of reference audio to clone or edit unseen voices. This approach allows for flexible manipulation of speech content and style without extensive training data for each new voice.
Quick Start & Requirements
tts_demo.py
, speech_editing_demo.py
).audiocraft
, xformers
, torchaudio
, tensorboard
, phonemizer
, datasets
, torchmetrics
, huggingface_hub
, ffmpeg
, espeak-ng
, and Montreal Forced Aligner (MFA) with English models.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The CC BY-NC-SA 4.0 and Coqui Public Model License 1.0.0 restrict commercial use. The disclaimer explicitly prohibits using the technology to generate or edit speech without consent, particularly for public figures, warning of potential copyright violations. Training requires careful data preparation and significant computational resources.
4 months ago
Inactive