Macaw-LLM  by lyuchenyang

Multi-modal LLM for image, video, audio, and text integration

created 2 years ago
1,578 stars

Top 27.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Macaw-LLM is an exploratory project that integrates image, video, audio, and text data for multi-modal language modeling. It targets researchers and developers working on advanced AI systems that require understanding and processing diverse data types, offering a unified approach to multi-modal instruction following.

How It Works

Macaw-LLM combines CLIP for visual encoding, Whisper for audio encoding, and a base LLM (LLaMA, Vicuna, or Bloom) for text processing. Its novel alignment strategy injects multi-modal features into the LLM by using them as queries against the LLM's embedding matrix as keys and values. This approach facilitates faster adaptation with minimal additional parameters, enabling a one-stage instruction fine-tuning process.

Quick Start & Requirements

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.