SongGen by LiuZH-19

Text-to-song generation with an auto-regressive transformer

Created 11 months ago

295 stars

Top 89.8% on SourcePulse

Project Summary

SongGen is a single-stage auto-regressive Transformer model for text-to-song generation, offering control via lyrics, descriptive text, and optional reference voice. It targets researchers and developers in music generation, providing a baseline for creating coherent and expressive songs from textual prompts.

How It Works

SongGen employs a single-stage auto-regressive Transformer architecture, directly generating audio tokens from text and lyrics. It supports both "Mixed Pro" (single-track) and "Interleaving A-V" (dual-track) modes, allowing for versatile song structures. The model leverages an X-Codec for audio tokenization and offers optional reference voice conditioning for style transfer.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n songgen_env python=3.9.18), activate it, and install dependencies (pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118, pip install flash-attn==2.6.1 --no-build-isolation). For inference-only, use pip install -e ..
Prerequisites: CUDA >= 11.8, PyTorch, X-Codec checkpoint (download and place in SongGen/songgen/xcodec_wrapper/xcodec_infer/ckpts/general_more).
Resources: Requires a GPU for inference. Training code and a detailed guide are available.
Links: Paper and Demo Page, MusicCaps Test Set, SongGen Interleaving (A-V) checkpoint, SongGen Mixed_pro checkpoint.

Highlighted Details

Single-stage auto-regressive Transformer for text-to-song generation.
Supports mixed (single-track) and dual-track modes.
Versatile control via lyrics, descriptive text, and optional reference voice.
Released checkpoints for both "Mixed Pro" and "Interleaving A-V" modes.

Maintenance & Community

The project is associated with ICML 2025. Training code and a detailed training guide have been released. Contact Zihan Liu (liuzihan@pjlab.org.cn) and Jiaqi Wang (wangjiaqi@pjlab.org.cn) for inquiries or collaborations.

Licensing & Compatibility

The repository does not explicitly state a license. The project builds upon Parler-tts, X-Codec, and lp-music-caps, whose licenses should be considered for compatibility.

Limitations & Caveats

The model is currently restricted to generating English songs with a maximum duration of 30 seconds due to limitations in the training dataset. Scaling up data and model size is suggested for further improvements in lyrics alignment and musicality.

SongGen by LiuZH-19

Explore Similar Projects

awesome-audio-plaza by metame-ai

mustango by AMAAI-Lab

UniAudio by yangdongchao

SpecVQGAN by v-iashin

soundstorm-pytorch by lucidrains

FunMusic by FunAudioLLM

SongGeneration by tencent-ailab

AudioLDM2 by haoheliu

AudioLDM by haoheliu

audiolm-pytorch by lucidrains

Amphion by open-mmlab

audiocraft by facebookresearch