NeuralSVB by MoonInTheRiver

Singing voice beautifier research paper implementation

Created 3 years ago

454 stars

Top 66.4% on SourcePulse

Project Summary

NeuralSVB is a PyTorch implementation for enhancing singing voice quality, targeting researchers and developers in speech synthesis and audio processing. It aims to "beautify" singing by improving timbre, pitch, and expressiveness, based on the ACL 2022 paper "Learning the Beauty in Songs."

How It Works

NeuralSVB employs a variational autoencoder (VAE) with a global mean-variance-log-variance (global_mle) objective for singing voice synthesis. It leverages pre-trained components, including a HifiGAN-Singing vocoder specialized for singing with a Non-stationary Filtering (NSF) mechanism and a Phoneme Posteriorgram (PPG) extractor. This approach allows for disentangled control over vocal timbre and expressive features, contributing to a more natural and aesthetically pleasing singing output.

Quick Start & Requirements

Install: pip install Requirements.txt (from the repository).
Prerequisites: PyTorch, CUDA (for GPU acceleration), and specific Python packages listed in requirements.txt. Requires pre-trained models for HifiGAN-Singing vocoder and PPG Extractor, which need to be downloaded and placed in the checkpoints directory.
Data: Requires the PopBuTFy dataset, which needs to be downloaded and registered via email. Data binarization scripts are provided.
Setup Time: Data preparation and model setup may take several hours depending on dataset size and download speeds.
Links: Demo Page, Paper

Highlighted Details

Official PyTorch implementation of an ACL 2022 paper.
Utilizes a VAE with a global_mle objective for singing voice synthesis.
Includes a specialized HifiGAN-Singing vocoder and PPG extractor.
Trained on 100+ hours of singing data (Chinese and English).

Maintenance & Community

The project is associated with the NATSpeech framework. Issues can be raised on GitHub, with a note that solutions are not guaranteed.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, its foundation on DiffSinger and relation to NATSpeech suggests potential Apache 2.0 or similar permissive licenses, but this requires verification. Compatibility for commercial use is not specified.

Limitations & Caveats

Inference from raw audio inputs is marked as "WIP" (Work In Progress). The README directs users to Appendix D of the paper for detailed limitations and solutions. The project's reliance on specific pre-trained models and a custom dataset may present integration challenges.

NeuralSVB by MoonInTheRiver

Explore Similar Projects

StyleSinger by AaronZ345

assem-vc by maum-ai

TCSinger by AaronZ345

MoeTTS by luoyily

lora-svc by PlayVoice

fish-diffusion by fishaudio

melgan by seungwonpark

TransformerTTS by spring-media

Amphion by open-mmlab

DiffSinger by MoonInTheRiver

whisper-vits-svc by PlayVoice

GPT-SoVITS by RVC-Boss