Macaw-LLM by lyuchenyang

Multi-modal LLM for image, video, audio, and text integration

Created 2 years ago

1,596 stars

Top 26.0% on SourcePulse

3 Experts Love This Project

JustinLin610

Core Maintainer at Alibaba Qwen

omarsar

Founder of DAIR.AI

hammer

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Macaw-LLM is an exploratory project that integrates image, video, audio, and text data for multi-modal language modeling. It targets researchers and developers working on advanced AI systems that require understanding and processing diverse data types, offering a unified approach to multi-modal instruction following.

How It Works

Macaw-LLM combines CLIP for visual encoding, Whisper for audio encoding, and a base LLM (LLaMA, Vicuna, or Bloom) for text processing. Its novel alignment strategy injects multi-modal features into the LLM by using them as queries against the LLM's embedding matrix as keys and values. This approach facilitates faster adaptation with minimal additional parameters, enabling a one-stage instruction fine-tuning process.

Quick Start & Requirements

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

5 stars in the last 30 days

Explore Similar Projects

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

bc-omni by westlake-baichuan-mllm

Open-source research paper for multimodal LLM

Created 1 year ago

Updated 11 months ago

Ola by Ola-Omni

Omni-modal language model research paper

Created 11 months ago

Updated 7 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

X-LLM by phellonchen

Multimodal LLM research paper

Created 2 years ago

Updated 2 years ago

Cheetah by DCDmllm

Multimodal LLM for following zero-shot demonstrative instructions

Created 2 years ago

Updated 1 year ago

ScreenAI by kyegomez

Vision-language model implementation for UI and infographics understanding

Created 1 year ago

Updated 2 months ago

MuMu-LLaMA by shansongliu

Multi-modal model for music understanding and generation research

Created 2 years ago

Updated 1 year ago

MILS by facebookresearch

Research paper implementation for multimodal LLM understanding

Created 1 year ago

Updated 8 months ago

LLaMA-VID by JIA-Lab-research

Multimodal LLM for long videos, based on LLaVA

Created 2 years ago

Updated 1 year ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 7 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen) and

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

mPLUG-Owl by X-PLUG

Multi-Modal Large Language Model (MLLM) research paper

Created 2 years ago

Updated 9 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

Any-to-any multimodal LLM research paper

Created 2 years ago

Updated 8 months ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen).

align-anything by PKU-Alignment

All-modality alignment framework for training models with feedback

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.