SongGen  by LiuZH-19

Text-to-song generation with an auto-regressive transformer

Created 7 months ago
260 stars

Top 97.7% on SourcePulse

GitHubView on GitHub
Project Summary

SongGen is a single-stage auto-regressive Transformer model for text-to-song generation, offering control via lyrics, descriptive text, and optional reference voice. It targets researchers and developers in music generation, providing a baseline for creating coherent and expressive songs from textual prompts.

How It Works

SongGen employs a single-stage auto-regressive Transformer architecture, directly generating audio tokens from text and lyrics. It supports both "Mixed Pro" (single-track) and "Interleaving A-V" (dual-track) modes, allowing for versatile song structures. The model leverages an X-Codec for audio tokenization and offers optional reference voice conditioning for style transfer.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n songgen_env python=3.9.18), activate it, and install dependencies (pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118, pip install flash-attn==2.6.1 --no-build-isolation). For inference-only, use pip install -e ..
  • Prerequisites: CUDA >= 11.8, PyTorch, X-Codec checkpoint (download and place in SongGen/songgen/xcodec_wrapper/xcodec_infer/ckpts/general_more).
  • Resources: Requires a GPU for inference. Training code and a detailed guide are available.
  • Links: Paper and Demo Page, MusicCaps Test Set, SongGen Interleaving (A-V) checkpoint, SongGen Mixed_pro checkpoint.

Highlighted Details

  • Single-stage auto-regressive Transformer for text-to-song generation.
  • Supports mixed (single-track) and dual-track modes.
  • Versatile control via lyrics, descriptive text, and optional reference voice.
  • Released checkpoints for both "Mixed Pro" and "Interleaving A-V" modes.

Maintenance & Community

The project is associated with ICML 2025. Training code and a detailed training guide have been released. Contact Zihan Liu (liuzihan@pjlab.org.cn) and Jiaqi Wang (wangjiaqi@pjlab.org.cn) for inquiries or collaborations.

Licensing & Compatibility

The repository does not explicitly state a license. The project builds upon Parler-tts, X-Codec, and lp-music-caps, whose licenses should be considered for compatibility.

Limitations & Caveats

The model is currently restricted to generating English songs with a maximum duration of 30 seconds due to limitations in the training dataset. Scaling up data and model size is suggested for further improvements in lyrics alignment and musicality.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.1%
3k
Audio generation research paper using latent diffusion
Created 2 years ago
Updated 2 months ago
Starred by Christian Laforte Christian Laforte(Distinguished Engineer at NVIDIA; Former CTO at Stability AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

Amphion by open-mmlab

0.2%
9k
Toolkit for audio, music, and speech generation research
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.