Multi-modal model for music understanding and generation research
Top 63.2% on sourcepulse
MuMu-LLaMA is a multi-modal model for music understanding and generation, targeting researchers and developers in AI music. It enables tasks like music question answering and generation from text, images, videos, and audio, with editing capabilities, leveraging a modular architecture for flexibility.
How It Works
MuMu-LLaMA integrates specialized encoders (MERT for music, ViT for images, ViViT for video) with a LLaMA 2 backbone. Music generation is handled by either MusicGen or AudioLDM2, connected via adapters. This multi-modal approach allows for rich music interaction and creation by combining diverse input modalities with a powerful language model.
Quick Start & Requirements
conda create --name <env> --file requirements.txt
.python gradio_app.py --model <path> --llama_dir <path> [--music_decoder <name>] [--music_decoder_path <path>]
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Training requires significant GPU resources (e.g., 2x 32GB V100 for stage 3), and inference needs a single 32GB V100 GPU. Loading model checkpoints requires approximately 49GB of CPU memory. The project is presented as a research artifact with an arXiv preprint, suggesting potential for ongoing development and potential instability.
7 months ago
1 day