TCSinger  by AaronZ345

PyTorch for zero-shot singing voice synthesis research

Created 1 year ago
361 stars

Top 77.5% on SourcePulse

GitHubView on GitHub
Project Summary

TCSinger is a PyTorch implementation for zero-shot singing voice synthesis (SVS) with advanced style transfer and multi-level control capabilities. It targets researchers and developers in audio synthesis and AI music generation, offering personalized and controllable SVS by enabling style transfer across different languages and singing styles.

How It Works

TCSinger employs a novel approach using a clustering style encoder to extract stylistic features and a Style and Duration Language Model (S&D-LM) for predicting style information and phoneme durations. The core innovation lies in its style-adaptive decoder, which utilizes a mel-style adaptive normalization method for generating intricate song details. This architecture addresses challenges in style modeling, transfer, and control, leading to improved synthesis quality and controllability.

Quick Start & Requirements

  • Installation: Clone the repository and set up a conda environment:
    conda create -n tcsinger python=3.10
    conda install --yes --file requirements.txt
    conda activate tcsinger
    
  • Prerequisites: NVIDIA GPU with CUDA and cuDNN. Pre-trained models for TCSinger, SAD, SDLM, and HIFI-GAN are available on HuggingFace or Google Drive.
  • Inference: Requires prompt audio (48k), target phonemes, notes, and durations. Inference scripts are provided in inference/style_transfer.py and inference/style_control.py.
  • Training: Requires a singing dataset (e.g., GTSinger) with specific metadata and phone set configurations. Training commands are provided for the main model, SAD, and SDLM.
  • Links: Demo Page (implied by description)

Highlighted Details

  • Zero-shot style transfer across cross-lingual speech and singing styles.
  • Multi-level style control for personalized SVS.
  • Style-adaptive decoder with mel-style adaptive normalization.
  • Outperforms baselines in synthesis quality, singer similarity, and style controllability.

Maintenance & Community

The project is associated with Zhejiang University and accepted by EMNLP 2024. Checkpoints were released in December 2024, and code in November 2024.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. The disclaimer prohibits generating singing without consent, particularly for public figures, and warns of potential copyright violations.

Limitations & Caveats

The provided pre-trained TCSinger checkpoint only supports Chinese and English. For multilingual capabilities, users must train their own models based on GTSinger. The effectiveness of the style control feature is noted as suboptimal for certain timbres due to the inclusion of speech and unannotated data.

Health Check
Last Commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.4%
35k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 6 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
52k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.