MuMu-LLaMA by shansongliu

Multi-modal model for music understanding and generation research

Created 2 years ago

509 stars

Top 61.4% on SourcePulse

Project Summary

MuMu-LLaMA is a multi-modal model for music understanding and generation, targeting researchers and developers in AI music. It enables tasks like music question answering and generation from text, images, videos, and audio, with editing capabilities, leveraging a modular architecture for flexibility.

How It Works

MuMu-LLaMA integrates specialized encoders (MERT for music, ViT for images, ViViT for video) with a LLaMA 2 backbone. Music generation is handled by either MusicGen or AudioLDM2, connected via adapters. This multi-modal approach allows for rich music interaction and creation by combining diverse input modalities with a powerful language model.

Quick Start & Requirements

Install via conda create --name <env> --file requirements.txt.
Requires Python 3.9.17 and NVIDIA Driver >= 12 (for PyTorch 2.1.0).
LLaMA-2 model weights are required (obtainable via HuggingFace).
Pre-trained checkpoints for MuMu-LLaMA (with MusicGen or AudioLDM2) and necessary multi-modal encoders are available.
Gradio demo can be run with python gradio_app.py --model <path> --llama_dir <path> [--music_decoder <name>] [--music_decoder_path <path>].
Official documentation and demo links are not explicitly provided, but checkpoint locations are detailed.

Highlighted Details

Supports music question answering and generation from text, image, video, and audio inputs.
Capable of music editing tasks.
Modular design allows integration with different music decoders (MusicGen, AudioLDM2).
Utilizes MERT, ViT, and ViViT for multi-modal understanding.

Maintenance & Community

The project is associated with the paper "MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models" (arXiv:2412.06660).
Code elements are derived from crypto-code/MU-LLaMA.
No explicit community channels (Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

The README does not specify a license.
Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

Training requires significant GPU resources (e.g., 2x 32GB V100 for stage 3), and inference needs a single 32GB V100 GPU. Loading model checkpoints requires approximately 49GB of CPU memory. The project is presented as a research artifact with an arXiv preprint, suggesting potential for ongoing development and potential instability.

MuMu-LLaMA by shansongliu

Explore Similar Projects

MU-LLaMA by shansongliu

DeepLearningMusicGeneration by carlosholivan

SongGen by LiuZH-19

mustango by AMAAI-Lab

music-generation-research by AI-Guru

WavJourney by Audio-AGI

FunMusic by FunAudioLLM

Macaw-LLM by lyuchenyang

SongGeneration by tencent-ailab

YuE by multimodal-art-projection

jukebox by openai

audiocraft by facebookresearch