TCSinger by AaronZ345

PyTorch for zero-shot singing voice synthesis research

Created 1 year ago

370 stars

Top 76.4% on SourcePulse

Project Summary

TCSinger is a PyTorch implementation for zero-shot singing voice synthesis (SVS) with advanced style transfer and multi-level control capabilities. It targets researchers and developers in audio synthesis and AI music generation, offering personalized and controllable SVS by enabling style transfer across different languages and singing styles.

How It Works

TCSinger employs a novel approach using a clustering style encoder to extract stylistic features and a Style and Duration Language Model (S&D-LM) for predicting style information and phoneme durations. The core innovation lies in its style-adaptive decoder, which utilizes a mel-style adaptive normalization method for generating intricate song details. This architecture addresses challenges in style modeling, transfer, and control, leading to improved synthesis quality and controllability.

Quick Start & Requirements

Installation: Clone the repository and set up a conda environment:

conda create -n tcsinger python=3.10
conda install --yes --file requirements.txt
conda activate tcsinger

Prerequisites: NVIDIA GPU with CUDA and cuDNN. Pre-trained models for TCSinger, SAD, SDLM, and HIFI-GAN are available on HuggingFace or Google Drive.
Inference: Requires prompt audio (48k), target phonemes, notes, and durations. Inference scripts are provided in inference/style_transfer.py and inference/style_control.py.
Training: Requires a singing dataset (e.g., GTSinger) with specific metadata and phone set configurations. Training commands are provided for the main model, SAD, and SDLM.
Links: Demo Page (implied by description)

Highlighted Details

Zero-shot style transfer across cross-lingual speech and singing styles.
Multi-level style control for personalized SVS.
Style-adaptive decoder with mel-style adaptive normalization.
Outperforms baselines in synthesis quality, singer similarity, and style controllability.

Maintenance & Community

The project is associated with Zhejiang University and accepted by EMNLP 2024. Checkpoints were released in December 2024, and code in November 2024.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The disclaimer prohibits generating singing without consent, particularly for public figures, and warns of potential copyright violations.

Limitations & Caveats

The provided pre-trained TCSinger checkpoint only supports Chinese and English. For multilingual capabilities, users must train their own models based on GTSinger. The effectiveness of the style control feature is noted as suboptimal for certain timbres due to the inclusion of speech and unannotated data.

TCSinger by AaronZ345

Explore Similar Projects

StyleSinger by AaronZ345

SongGen by LiuZH-19

GenerSpeech by Rongjiehuang

StyleTTS by yl4579

lora-svc by PlayVoice

SongGeneration by tencent-ailab

higgs-audio by boson-ai

DiffSinger by MoonInTheRiver

whisper-vits-svc by PlayVoice

VibeVoice by microsoft

OpenVoice by myshell-ai

GPT-SoVITS by RVC-Boss