CLI tool for zero-shot voice/singing voice conversion, supporting real-time
Top 17.5% on sourcepulse
Seed-VC offers zero-shot voice conversion (VC) and singing voice conversion (SVC) with real-time capabilities. It allows users to clone voices from short audio samples (1-30 seconds) without prior training, and supports fine-tuning with minimal data for improved performance. The project targets users needing voice transformation for applications like online meetings, gaming, and live streaming, as well as musicians and content creators.
How It Works
Seed-VC utilizes a U-ViT architecture with skip connections, incorporating OpenAI's Whisper as a speech content encoder and NVIDIA's BigVGAN or HIFT for vocoding. The V2 model introduces ASTRAL-Quantization for speaker-disentangled speech tokenization, enabling better accent and emotion conversion. The approach leverages diffusion models for high-quality audio generation, with configurable parameters for balancing speed, intelligibility, and similarity.
Quick Start & Requirements
pip install -r requirements.txt
(Linux/Windows) or pip install -r requirements-mac.txt
(Mac M Series). For Windows users, pip install triton-windows==3.2.0.post13
is recommended for V2 model speed-ups.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
_tkinter
errors, requiring a Python installation with Tkinter support.3 months ago
1 day