whisper-vits-svc  by PlayVoice

Singing voice conversion engine based on VITS

created 2 years ago
2,809 stars

Top 17.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a core engine for singing voice conversion and cloning, targeting deep learning beginners with practical applications. It enables users to create unique singing voices, mix speakers, and even convert voices with light accompaniment, offering a hands-on approach to mastering deep learning concepts.

How It Works

The system leverages a Variational Inference with Adversarial Learning approach based on VITS. It integrates multiple advanced components, including OpenAI's Whisper for noise immunity, NVIDIA's BigVGAN for improved audio quality, and Google's speaker encoder for timbre encoding. Novel contributions include PPG and HuBERT perturbations for enhanced noise immunity and de-timbre, alongside a MIX encoder and USP inference for improved conversion stability.

Quick Start & Requirements

  • Install: pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
  • Prerequisites: PyTorch, Whisper (large-v2.pt), speaker encoder model (best_model.pth.tar), hubert-soft model (hubert-soft-0d54a1f4.pt), crepe full model (full.pth), and a pre-trained VITS model (sovits5.0.pretrain.pth).
  • VRAM: Minimum 6GB for training.
  • Data Prep: Requires audio separation, slicing (under 30s), loudness adjustment, and specific directory structures.
  • Links: Hugging Face Spaces demo: https://huggingface.co/spaces/maxmax20160403/sovits5.0

Highlighted Details

  • Supports multiple speakers and speaker mixing for unique voice creation.
  • Offers branches for improved audio quality (bigvgan-mix-v2) and faster inference (RoFormer-HiFTNet).
  • Allows manual F0 editing via Excel for fine-tuning pitch.
  • Integrates feature retrieval for enhanced timbre stability during inference.

Maintenance & Community

The project references several influential research papers and codebases, indicating a strong foundation in the field. Specific community links (Discord/Slack) or active contributor information are not detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project does not support real-time voice conversion. Training can be time-consuming due to data perturbation techniques. Manual intervention is required for F0 adjustment during inference for optimal results.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
41 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.