TCSinger  by AaronZ345

PyTorch for zero-shot singing voice synthesis research

created 10 months ago
349 stars

Top 80.9% on sourcepulse

GitHubView on GitHub
Project Summary

TCSinger is a PyTorch implementation for zero-shot singing voice synthesis (SVS) with advanced style transfer and multi-level control capabilities. It targets researchers and developers in audio synthesis and AI music generation, offering personalized and controllable SVS by enabling style transfer across different languages and singing styles.

How It Works

TCSinger employs a novel approach using a clustering style encoder to extract stylistic features and a Style and Duration Language Model (S&D-LM) for predicting style information and phoneme durations. The core innovation lies in its style-adaptive decoder, which utilizes a mel-style adaptive normalization method for generating intricate song details. This architecture addresses challenges in style modeling, transfer, and control, leading to improved synthesis quality and controllability.

Quick Start & Requirements

  • Installation: Clone the repository and set up a conda environment:
    conda create -n tcsinger python=3.10
    conda install --yes --file requirements.txt
    conda activate tcsinger
    
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN. Pre-trained models for TCSinger, SAD, SDLM, and HIFI-GAN are available on HuggingFace or Google Drive.
  • Inference: Requires prompt audio (48k), target phonemes, notes, and durations. Inference scripts are provided in inference/style_transfer.py and inference/style_control.py.
  • Training: Requires a singing dataset (e.g., GTSinger) with specific metadata and phone set configurations. Training commands are provided for the main model, SAD, and SDLM.
  • Links: Demo Page (implied by description)

Highlighted Details

  • Zero-shot style transfer across cross-lingual speech and singing styles.
  • Multi-level style control for personalized SVS.
  • Style-adaptive decoder with mel-style adaptive normalization.
  • Outperforms baselines in synthesis quality, singer similarity, and style controllability.

Maintenance & Community

The project is associated with Zhejiang University and accepted by EMNLP 2024. Checkpoints were released in December 2024, and code in November 2024.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The disclaimer prohibits generating singing without consent, particularly for public figures, and warns of potential copyright violations.

Limitations & Caveats

The provided pre-trained TCSinger checkpoint only supports Chinese and English. For multilingual capabilities, users must train their own models based on GTSinger. The effectiveness of the style control feature is noted as suboptimal for certain timbres due to the inclusion of speech and unannotated data.

Health Check
Last commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.3%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 1 year ago
Feedback? Help us improve.