PyTorch for zero-shot style transfer of out-of-domain singing voice synthesis
Top 72.5% on sourcepulse
StyleSinger is a PyTorch implementation for zero-shot style transfer in singing voice synthesis, targeting researchers and developers in AI music generation. It enables the creation of singing voices with unseen styles by adapting to reference audio samples, offering superior audio quality and similarity compared to baseline models.
How It Works
StyleSinger employs a Residual Style Adaptor (RSA) that utilizes a residual quantization model to precisely capture diverse style characteristics from reference singing voice samples. To enhance generalization, it introduces Uncertainty Modeling Layer Normalization (UMLN), which perturbs style information within content representations during training.
Quick Start & Requirements
requirements.txt
.checkpoints/
directory, and run CUDA_VISIBLE_DEVICES=$GPU python inference/StyleSinger.py --config egs/stylesinger.yaml
.CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stylesinger.yaml
, followed by training with CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml --exp_name StyleSinger
.Highlighted Details
Maintenance & Community
The project is associated with Zhejiang University and Huawei Cloud. The AAAI 2024 paper is cited.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. However, the disclaimer prohibits using the technology to generate singing without consent, particularly for public figures, which may imply usage restrictions.
Limitations & Caveats
The provided pre-trained checkpoint exclusively supports Chinese singing voices. Multilingual style transfer necessitates training custom models using datasets like GTSinger. The disclaimer also warns against unauthorized generation of singing voices, potentially impacting commercial use cases.
2 months ago
Inactive