Multi-modal LLM for image, video, audio, and text integration
Top 27.0% on sourcepulse
Macaw-LLM is an exploratory project that integrates image, video, audio, and text data for multi-modal language modeling. It targets researchers and developers working on advanced AI systems that require understanding and processing diverse data types, offering a unified approach to multi-modal instruction following.
How It Works
Macaw-LLM combines CLIP for visual encoding, Whisper for audio encoding, and a base LLM (LLaMA, Vicuna, or Bloom) for text processing. Its novel alignment strategy injects multi-modal features into the LLM by using them as queries against the LLM's embedding matrix as keys and values. This approach facilitates faster adaptation with minimal additional parameters, enabling a one-stage instruction fine-tuning process.
Quick Start & Requirements
pip install -r requirements.txt
, install ffmpeg
(e.g., yum install ffmpeg -y
), and install apex
from source.ffmpeg
, NVIDIA Apex.7 months ago
1 week