Multimodal LLM research code for music-related tasks
Top 79.4% on sourcepulse
This repository provides the code for the LLark multimodal instruction-following language model for music, as detailed in the ICML 2024 paper. It enables researchers and developers to preprocess audio datasets, generate instruction-tuning data, extract embeddings, and evaluate music models, targeting advancements in AI-driven music understanding and generation.
How It Works
The project leverages Apache Beam for scalable data preprocessing, allowing for local execution or parallel processing on Google Cloud Dataflow. It integrates with OpenAI for instruction data generation and utilizes pre-trained Jukebox embeddings and CLAP models for audio feature extraction. The architecture is designed to facilitate the training and evaluation of multimodal music language models.
Quick Start & Requirements
m2t-train.dockerfile
, m2t-preprocess.dockerfile
, jukebox-embed.dockerfile
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
This repository does not include trained models. The data preprocessing pipeline requires significant setup for Google Cloud Dataflow integration, including Docker image management and potential cost implications. Official training support is not provided, with scripts offered primarily for hyperparameter reference.
1 year ago
Inactive