Discover and explore top open-source AI tools and projects—updated daily.
lyuchenyangMulti-modal LLM for image, video, audio, and text integration
Top 26.0% on SourcePulse
Macaw-LLM is an exploratory project that integrates image, video, audio, and text data for multi-modal language modeling. It targets researchers and developers working on advanced AI systems that require understanding and processing diverse data types, offering a unified approach to multi-modal instruction following.
How It Works
Macaw-LLM combines CLIP for visual encoding, Whisper for audio encoding, and a base LLM (LLaMA, Vicuna, or Bloom) for text processing. Its novel alignment strategy injects multi-modal features into the LLM by using them as queries against the LLM's embedding matrix as keys and values. This approach facilitates faster adaptation with minimal additional parameters, enabling a one-stage instruction fine-tuning process.
Quick Start & Requirements
pip install -r requirements.txt, install ffmpeg (e.g., yum install ffmpeg -y), and install apex from source.ffmpeg, NVIDIA Apex.1 year ago
Inactive
InternLM
X-PLUG
NExT-GPT