whisper-vits-svc by PlayVoice

Singing voice conversion engine based on VITS

Created 3 years ago

2,844 stars

Top 16.6% on SourcePulse

Project Summary

This project provides a core engine for singing voice conversion and cloning, targeting deep learning beginners with practical applications. It enables users to create unique singing voices, mix speakers, and even convert voices with light accompaniment, offering a hands-on approach to mastering deep learning concepts.

How It Works

The system leverages a Variational Inference with Adversarial Learning approach based on VITS. It integrates multiple advanced components, including OpenAI's Whisper for noise immunity, NVIDIA's BigVGAN for improved audio quality, and Google's speaker encoder for timbre encoding. Novel contributions include PPG and HuBERT perturbations for enhanced noise immunity and de-timbre, alongside a MIX encoder and USP inference for improved conversion stability.

Quick Start & Requirements

Install: pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
Prerequisites: PyTorch, Whisper (large-v2.pt), speaker encoder model (best_model.pth.tar), hubert-soft model (hubert-soft-0d54a1f4.pt), crepe full model (full.pth), and a pre-trained VITS model (sovits5.0.pretrain.pth).
VRAM: Minimum 6GB for training.
Data Prep: Requires audio separation, slicing (under 30s), loudness adjustment, and specific directory structures.
Links: Hugging Face Spaces demo: https://huggingface.co/spaces/maxmax20160403/sovits5.0

Highlighted Details

Supports multiple speakers and speaker mixing for unique voice creation.
Offers branches for improved audio quality (bigvgan-mix-v2) and faster inference (RoFormer-HiFTNet).
Allows manual F0 editing via Excel for fine-tuning pitch.
Integrates feature retrieval for enhanced timbre stability during inference.

Maintenance & Community

The project references several influential research papers and codebases, indicating a strong foundation in the field. Specific community links (Discord/Slack) or active contributor information are not detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project does not support real-time voice conversion. Training can be time-consuming due to data perturbation techniques. Manual intervention is required for F0 adjustment during inference for optimal results.

whisper-vits-svc by PlayVoice

Explore Similar Projects

ComfyUI-F5-TTS by niknah

MMVC_Trainer by isletennos

lora-svc by PlayVoice

StarGANv2-VC by yl4579

Easy-Voice-Toolkit by Spr-Aachen

seed-vc by Plachtaa

higgs-audio by boson-ai

VITS-fast-fine-tuning by Plachtaa

OpenVoice by myshell-ai

so-vits-svc by svc-develop-team

GPT-SoVITS by RVC-Boss

Real-Time-Voice-Cloning by CorentinJ