Singing voice conversion engine based on VITS
Top 17.3% on sourcepulse
This project provides a core engine for singing voice conversion and cloning, targeting deep learning beginners with practical applications. It enables users to create unique singing voices, mix speakers, and even convert voices with light accompaniment, offering a hands-on approach to mastering deep learning concepts.
How It Works
The system leverages a Variational Inference with Adversarial Learning approach based on VITS. It integrates multiple advanced components, including OpenAI's Whisper for noise immunity, NVIDIA's BigVGAN for improved audio quality, and Google's speaker encoder for timbre encoding. Novel contributions include PPG and HuBERT perturbations for enhanced noise immunity and de-timbre, alongside a MIX encoder and USP inference for improved conversion stability.
Quick Start & Requirements
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
best_model.pth.tar
), hubert-soft model (hubert-soft-0d54a1f4.pt
), crepe full model (full.pth
), and a pre-trained VITS model (sovits5.0.pretrain.pth
).Highlighted Details
bigvgan-mix-v2
) and faster inference (RoFormer-HiFTNet
).Maintenance & Community
The project references several influential research papers and codebases, indicating a strong foundation in the field. Specific community links (Discord/Slack) or active contributor information are not detailed in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project does not support real-time voice conversion. Training can be time-consuming due to data perturbation techniques. Manual intervention is required for F0 adjustment during inference for optimal results.
1 year ago
Inactive