StyleSinger by AaronZ345

PyTorch for zero-shot style transfer of out-of-domain singing voice synthesis

Created 1 year ago

416 stars

Top 70.4% on SourcePulse

Project Summary

StyleSinger is a PyTorch implementation for zero-shot style transfer in singing voice synthesis, targeting researchers and developers in AI music generation. It enables the creation of singing voices with unseen styles by adapting to reference audio samples, offering superior audio quality and similarity compared to baseline models.

How It Works

StyleSinger employs a Residual Style Adaptor (RSA) that utilizes a residual quantization model to precisely capture diverse style characteristics from reference singing voice samples. To enhance generalization, it introduces Uncertainty Modeling Layer Normalization (UMLN), which perturbs style information within content representations during training.

Quick Start & Requirements

Installation: Requires a conda environment with Python 3.8 and dependencies listed in requirements.txt.
Prerequisites: NVIDIA GPU with CUDA and cuDNN.
Models: Pre-trained models for acoustic, vocoder (HIFI-GAN), and emotion encoders are available on HuggingFace or Google Drive.
Data: Supports Chinese singing voices with the provided checkpoint. For multilingual support, users must train their own models based on the GTSinger dataset.
Inference: Download checkpoints, place them in the checkpoints/ directory, and run CUDA_VISIBLE_DEVICES=$GPU python inference/StyleSinger.py --config egs/stylesinger.yaml.
Training: Data preprocessing via CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stylesinger.yaml, followed by training with CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml --exp_name StyleSinger.
Resources: Official demo page for audio samples.

Highlighted Details

First singing voice synthesis model for zero-shot style transfer of out-of-domain singing voices.
Features Residual Style Adaptor (RSA) for meticulous style characteristic capture.
Incorporates Uncertainty Modeling Layer Normalization (UMLN) for improved generalization.
Achieves superior audio quality and similarity in extensive zero-shot style transfer experiments.

Maintenance & Community

The project is associated with Zhejiang University and Huawei Cloud. The AAAI 2024 paper is cited.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, the disclaimer prohibits using the technology to generate singing without consent, particularly for public figures, which may imply usage restrictions.

Limitations & Caveats

The provided pre-trained checkpoint exclusively supports Chinese singing voices. Multilingual style transfer necessitates training custom models using datasets like GTSinger. The disclaimer also warns against unauthorized generation of singing voices, potentially impacting commercial use cases.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days