Macaw-LLM  by lyuchenyang

Multi-modal LLM for image, video, audio, and text integration

Created 2 years ago
1,582 stars

Top 26.5% on SourcePulse

GitHubView on GitHub
Project Summary

Macaw-LLM is an exploratory project that integrates image, video, audio, and text data for multi-modal language modeling. It targets researchers and developers working on advanced AI systems that require understanding and processing diverse data types, offering a unified approach to multi-modal instruction following.

How It Works

Macaw-LLM combines CLIP for visual encoding, Whisper for audio encoding, and a base LLM (LLaMA, Vicuna, or Bloom) for text processing. Its novel alignment strategy injects multi-modal features into the LLM by using them as queries against the LLM's embedding matrix as keys and values. This approach facilitates faster adaptation with minimal additional parameters, enabling a one-stage instruction fine-tuning process.

Quick Start & Requirements

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

X-LLM by phellonchen

0%
314
Multimodal LLM research paper
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.