MuMu-LLaMA  by shansongliu

Multi-modal model for music understanding and generation research

created 1 year ago
498 stars

Top 63.2% on sourcepulse

GitHubView on GitHub
Project Summary

MuMu-LLaMA is a multi-modal model for music understanding and generation, targeting researchers and developers in AI music. It enables tasks like music question answering and generation from text, images, videos, and audio, with editing capabilities, leveraging a modular architecture for flexibility.

How It Works

MuMu-LLaMA integrates specialized encoders (MERT for music, ViT for images, ViViT for video) with a LLaMA 2 backbone. Music generation is handled by either MusicGen or AudioLDM2, connected via adapters. This multi-modal approach allows for rich music interaction and creation by combining diverse input modalities with a powerful language model.

Quick Start & Requirements

  • Install via conda create --name <env> --file requirements.txt.
  • Requires Python 3.9.17 and NVIDIA Driver >= 12 (for PyTorch 2.1.0).
  • LLaMA-2 model weights are required (obtainable via HuggingFace).
  • Pre-trained checkpoints for MuMu-LLaMA (with MusicGen or AudioLDM2) and necessary multi-modal encoders are available.
  • Gradio demo can be run with python gradio_app.py --model <path> --llama_dir <path> [--music_decoder <name>] [--music_decoder_path <path>].
  • Official documentation and demo links are not explicitly provided, but checkpoint locations are detailed.

Highlighted Details

  • Supports music question answering and generation from text, image, video, and audio inputs.
  • Capable of music editing tasks.
  • Modular design allows integration with different music decoders (MusicGen, AudioLDM2).
  • Utilizes MERT, ViT, and ViViT for multi-modal understanding.

Maintenance & Community

  • The project is associated with the paper "MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models" (arXiv:2412.06660).
  • Code elements are derived from crypto-code/MU-LLaMA.
  • No explicit community channels (Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

  • The README does not specify a license.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

Training requires significant GPU resources (e.g., 2x 32GB V100 for stage 3), and inference needs a single 32GB V100 GPU. Loading model checkpoints requires approximately 49GB of CPU memory. The project is presented as a research artifact with an arXiv preprint, suggesting potential for ongoing development and potential instability.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.