seed-vc by Plachtaa

CLI tool for zero-shot voice/singing voice conversion, supporting real-time

Created 1 year ago

3,858 stars

Top 12.3% on SourcePulse

Project Summary

Seed-VC offers zero-shot voice conversion (VC) and singing voice conversion (SVC) with real-time capabilities. It allows users to clone voices from short audio samples (1-30 seconds) without prior training, and supports fine-tuning with minimal data for improved performance. The project targets users needing voice transformation for applications like online meetings, gaming, and live streaming, as well as musicians and content creators.

How It Works

Seed-VC utilizes a U-ViT architecture with skip connections, incorporating OpenAI's Whisper as a speech content encoder and NVIDIA's BigVGAN or HIFT for vocoding. The V2 model introduces ASTRAL-Quantization for speaker-disentangled speech tokenization, enabling better accent and emotion conversion. The approach leverages diffusion models for high-quality audio generation, with configurable parameters for balancing speed, intelligibility, and similarity.

Quick Start & Requirements

Installation: pip install -r requirements.txt (Linux/Windows) or pip install -r requirements-mac.txt (Mac M Series). For Windows users, pip install triton-windows==3.2.0.post13 is recommended for V2 model speed-ups.
Prerequisites: Python 3.10+, GPU recommended for real-time performance.
Resources: Checkpoints are auto-downloaded.
Docs: Demo Page, Evaluation

Highlighted Details

Supports zero-shot voice conversion, real-time VC, and singing voice conversion.
Fine-tuning requires minimal data (1 utterance/speaker) and is fast (2 min on T4).
Real-time VC offers ~300ms algorithm delay + ~100ms device delay.
V2 model enhances voice and accent conversion, with better source speaker anonymization.

Maintenance & Community

Active development with recent updates including V2 model release and Mac M Series support.
No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Real-time GUI on Mac may encounter _tkinter errors, requiring a Python installation with Tkinter support.
The README does not mention specific hardware requirements beyond recommending a GPU for real-time performance.
No explicit license information is provided, which could impact commercial adoption.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

73 stars in the last 30 days

Explore Similar Projects

Starred by

Piotr Dąbkowski

Piotr Dąbkowski(Cofounder of ElevenLabs).

assem-vc by maum-ai

PyTorch code for any-to-many voice conversion research

Created 5 years ago

Updated 4 years ago

Auralis by astramind-ai

TTS engine for fast voice cloning

Created 1 year ago

Updated 1 year ago

MMVC_Trainer by isletennos

Voice conversion trainer for real-time voice changer

Created 4 years ago

Updated 1 year ago

lora-svc by PlayVoice

Singing voice conversion tool using Whisper & BigVGAN

Created 3 years ago

Updated 2 years ago

sesame_csm_openai by phildougherty

OpenAI-compatible TTS API for voice cloning

Created 1 year ago

Updated 9 months ago

Qwen3-Audiobook-Converter by WhiskeyCoder

Audiobook converter using advanced TTS and voice cloning

Created 5 months ago

Updated 3 months ago

Easy-Voice-Toolkit by Spr-Aachen

Local AI voice toolkit for audio processing, recognition, transcription, and conversion

Created 3 years ago

Updated 1 week ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

3 more.

WhisperSpeech by WhisperSpeech

Open-source text-to-speech system built by inverting Whisper

Created 3 years ago

Updated 6 months ago

VITS-fast-fine-tuning by Plachtaa

VITS pipeline for fast speaker adaptation TTS and voice conversion

Created 3 years ago

Updated 1 year ago

Starred by

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT),

Jason Huggins

Jason Huggins(Creator of Selenium), and

5 more.

KittenTTS by KittenML

Realistic text-to-speech model under 25MB

Created 11 months ago

Updated 1 month ago

whisper-vits-svc by PlayVoice

Singing voice conversion engine based on VITS

Created 3 years ago

Updated 2 years ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 2 years ago

Updated 1 month ago

Feedback? Help us improve.