lora-svc  by PlayVoice

Singing voice conversion tool using Whisper & BigVGAN

created 2 years ago
640 stars

Top 52.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a singing voice conversion (SVC) system that leverages OpenAI's Whisper for content encoding and NVIDIA's BigVGAN for neural source-filter synthesis. It targets researchers and hobbyists interested in AI-powered voice manipulation and singing synthesis, enabling users to clone singing voices with a high degree of control.

How It Works

The system processes audio by first separating accompaniment, then cutting it into short segments for Whisper to extract content embeddings (PPG). Simultaneously, it extracts pitch (F0) and speaker timbre information. These features are then fed into a BigVGAN-based generator, conditioned on the target speaker's timbre, to synthesize the converted singing voice. This multi-stage approach aims for high-fidelity conversion by decoupling content, pitch, and timbre.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites:
    • Python 3.x
    • Download Whisper medium model (medium.pt)
    • Download Timbre Encoder (best_model.pth.tar)
    • Download BigVGAN pre-trained model (maxgan_pretrain_32K.pth)
  • Setup: Requires downloading multiple pre-trained models and preparing datasets with specific directory structures. Data preprocessing involves several Python scripts for resampling, pitch extraction, PPG extraction, and timbre code extraction.
  • Links: Demo Video

Highlighted Details

  • Leverages three AI giants: OpenAI Whisper, NVIDIA BigVGAN, and Microsoft Adapter.
  • Supports multi-language models for Whisper.
  • Offers both command-line inference and a GUI (svc_gui.py).
  • Includes steps for exporting inference models and post-processing with VAD.

Maintenance & Community

  • The project references several research papers and GitHub repositories, indicating a foundation in established AI techniques.
  • No explicit community links (Discord, Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The project incorporates code from various sources, each with its own license. Users should verify compatibility for commercial use.

Limitations & Caveats

  • LoRA implementation is noted as not fully integrated within this specific repository.
  • The setup process is complex, requiring manual downloading of multiple large pre-trained models and careful data preparation.
  • Performance and quality are highly dependent on the quality of the input audio and the chosen pre-trained models.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.