NeuralSVB  by MoonInTheRiver

Singing voice beautifier research paper implementation

created 3 years ago
440 stars

Top 69.0% on sourcepulse

GitHubView on GitHub
Project Summary

NeuralSVB is a PyTorch implementation for enhancing singing voice quality, targeting researchers and developers in speech synthesis and audio processing. It aims to "beautify" singing by improving timbre, pitch, and expressiveness, based on the ACL 2022 paper "Learning the Beauty in Songs."

How It Works

NeuralSVB employs a variational autoencoder (VAE) with a global mean-variance-log-variance (global_mle) objective for singing voice synthesis. It leverages pre-trained components, including a HifiGAN-Singing vocoder specialized for singing with a Non-stationary Filtering (NSF) mechanism and a Phoneme Posteriorgram (PPG) extractor. This approach allows for disentangled control over vocal timbre and expressive features, contributing to a more natural and aesthetically pleasing singing output.

Quick Start & Requirements

  • Install: pip install Requirements.txt (from the repository).
  • Prerequisites: PyTorch, CUDA (for GPU acceleration), and specific Python packages listed in requirements.txt. Requires pre-trained models for HifiGAN-Singing vocoder and PPG Extractor, which need to be downloaded and placed in the checkpoints directory.
  • Data: Requires the PopBuTFy dataset, which needs to be downloaded and registered via email. Data binarization scripts are provided.
  • Setup Time: Data preparation and model setup may take several hours depending on dataset size and download speeds.
  • Links: Demo Page, Paper

Highlighted Details

  • Official PyTorch implementation of an ACL 2022 paper.
  • Utilizes a VAE with a global_mle objective for singing voice synthesis.
  • Includes a specialized HifiGAN-Singing vocoder and PPG extractor.
  • Trained on 100+ hours of singing data (Chinese and English).

Maintenance & Community

The project is associated with the NATSpeech framework. Issues can be raised on GitHub, with a note that solutions are not guaranteed.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, its foundation on DiffSinger and relation to NATSpeech suggests potential Apache 2.0 or similar permissive licenses, but this requires verification. Compatibility for commercial use is not specified.

Limitations & Caveats

Inference from raw audio inputs is marked as "WIP" (Work In Progress). The README directs users to Appendix D of the paper for detailed limitations and solutions. The project's reliance on specific pre-trained models and a custom dataset may present integration challenges.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.