Discover and explore top open-source AI tools and projects—updated daily.
Text-to-song generation with an auto-regressive transformer
Top 97.7% on SourcePulse
SongGen is a single-stage auto-regressive Transformer model for text-to-song generation, offering control via lyrics, descriptive text, and optional reference voice. It targets researchers and developers in music generation, providing a baseline for creating coherent and expressive songs from textual prompts.
How It Works
SongGen employs a single-stage auto-regressive Transformer architecture, directly generating audio tokens from text and lyrics. It supports both "Mixed Pro" (single-track) and "Interleaving A-V" (dual-track) modes, allowing for versatile song structures. The model leverages an X-Codec for audio tokenization and offers optional reference voice conditioning for style transfer.
Quick Start & Requirements
conda create -n songgen_env python=3.9.18
), activate it, and install dependencies (pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
, pip install flash-attn==2.6.1 --no-build-isolation
). For inference-only, use pip install -e .
.SongGen/songgen/xcodec_wrapper/xcodec_infer/ckpts/general_more
).Highlighted Details
Maintenance & Community
The project is associated with ICML 2025. Training code and a detailed training guide have been released. Contact Zihan Liu (liuzihan@pjlab.org.cn) and Jiaqi Wang (wangjiaqi@pjlab.org.cn) for inquiries or collaborations.
Licensing & Compatibility
The repository does not explicitly state a license. The project builds upon Parler-tts, X-Codec, and lp-music-caps, whose licenses should be considered for compatibility.
Limitations & Caveats
The model is currently restricted to generating English songs with a maximum duration of 30 seconds due to limitations in the training dataset. Scaling up data and model size is suggested for further improvements in lyrics alignment and musicality.
2 months ago
Inactive