StyleSinger  by AaronZ345

PyTorch for zero-shot style transfer of out-of-domain singing voice synthesis

created 1 year ago
408 stars

Top 72.5% on sourcepulse

GitHubView on GitHub
Project Summary

StyleSinger is a PyTorch implementation for zero-shot style transfer in singing voice synthesis, targeting researchers and developers in AI music generation. It enables the creation of singing voices with unseen styles by adapting to reference audio samples, offering superior audio quality and similarity compared to baseline models.

How It Works

StyleSinger employs a Residual Style Adaptor (RSA) that utilizes a residual quantization model to precisely capture diverse style characteristics from reference singing voice samples. To enhance generalization, it introduces Uncertainty Modeling Layer Normalization (UMLN), which perturbs style information within content representations during training.

Quick Start & Requirements

  • Installation: Requires a conda environment with Python 3.8 and dependencies listed in requirements.txt.
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN.
  • Models: Pre-trained models for acoustic, vocoder (HIFI-GAN), and emotion encoders are available on HuggingFace or Google Drive.
  • Data: Supports Chinese singing voices with the provided checkpoint. For multilingual support, users must train their own models based on the GTSinger dataset.
  • Inference: Download checkpoints, place them in the checkpoints/ directory, and run CUDA_VISIBLE_DEVICES=$GPU python inference/StyleSinger.py --config egs/stylesinger.yaml.
  • Training: Data preprocessing via CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stylesinger.yaml, followed by training with CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml --exp_name StyleSinger.
  • Resources: Official demo page for audio samples.

Highlighted Details

  • First singing voice synthesis model for zero-shot style transfer of out-of-domain singing voices.
  • Features Residual Style Adaptor (RSA) for meticulous style characteristic capture.
  • Incorporates Uncertainty Modeling Layer Normalization (UMLN) for improved generalization.
  • Achieves superior audio quality and similarity in extensive zero-shot style transfer experiments.

Maintenance & Community

The project is associated with Zhejiang University and Huawei Cloud. The AAAI 2024 paper is cited.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the disclaimer prohibits using the technology to generate singing without consent, particularly for public figures, which may imply usage restrictions.

Limitations & Caveats

The provided pre-trained checkpoint exclusively supports Chinese singing voices. Multilingual style transfer necessitates training custom models using datasets like GTSinger. The disclaimer also warns against unauthorized generation of singing voices, potentially impacting commercial use cases.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Feedback? Help us improve.