X-LLM by phellonchen

Multimodal LLM research paper

Created 2 years ago

315 stars

Top 85.8% on SourcePulse

1 Expert Loves This Project

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

X-LLM is a framework for building multimodal large language models by treating different data modalities as foreign languages. It targets researchers and developers aiming to integrate capabilities like image, audio, or device status understanding into LLMs, leveraging the ChatGLM architecture. The primary benefit is enabling LLMs to process and reason about diverse data types beyond text.

How It Works

X-LLM employs a three-stage training process. First, it converts multimodal inputs into a "foreign language" representation using X2L interfaces, with only these interfaces being updated. Second, these representations are aligned with the LLM (ChatGLM), again updating only the X2L interfaces. Finally, multiple modalities are integrated, with updates restricted to adapters within the X2L interfaces. This staged approach, inspired by BLIP-2, allows for efficient bootstrapping of multimodal capabilities.

Quick Start & Requirements

Install: conda create -n lavis python=3.8, conda activate lavis, git clone https://github.com/phellonchen/X-LLM.git, cd X-LLM, pip install -e .
Prerequisites: Python 3.8, Conda. Specific dataset details are in README_DATA.md. Training and evaluation details are in README_TRAIN_EVAL.md.

Highlighted Details

Achieves 84.5% relative score compared to GPT-4 on a custom evaluation dataset of 90 language-image instructions.
Supports integrating various modalities, including non-speech audio and terminal device status.
Leverages ChatGLM for its Chinese language capabilities and follows the BLIP-2 model architecture.

Maintenance & Community

The project is based on ChatGLM and BLIP-2. Code release is pending.

Licensing & Compatibility

The README does not explicitly state a license. The project is presented as an academic research artifact.

Limitations & Caveats

The full codebase and specific training/evaluation details are not yet released. The project relies on the ChatGLM base model.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

bc-omni by westlake-baichuan-mllm

Open-source research paper for multimodal LLM

Created 1 year ago

Updated 11 months ago

E5-V by kongds

Multimodal embeddings via LLM adaptation

Created 1 year ago

Updated 1 month ago

Awesome-Multimodal-LLM by HenryHZY

Collection of research trends in LLM-guided multimodal learning

Created 2 years ago

Updated 2 years ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

HPT by HyperGAI

Open multimodal LLM framework for vision-language tasks

Created 1 year ago

Updated 1 year ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo).

OneLLM by csuhan

Multimodal research paper aligning modalities with language

Created 2 years ago

Updated 1 year ago

Stream-Omni by ictnlp

GPT-4o-like multimodal chatbot

Created 6 months ago

Updated 6 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Jesse Clark

Jesse Clark(Cofounder of Marqo).

LanguageBind by PKU-YuanGroup

Multimodal pretraining research paper using language-based semantic alignment

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Wing Lian

Wing Lian(Founder of Axolotl AI).

AnyGPT by OpenMOSS

Multimodal LLM research paper for any-to-any modality conversion

Created 1 year ago

Updated 1 year ago

py-gpt by szczyglis-dev

Desktop AI assistant for multimodal interaction with various LLMs

Created 2 years ago

Updated 3 days ago

Starred by

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

5 more.

ultravox by fixie-ai

Multimodal LLM for real-time voice interactions

Created 1 year ago

Updated 1 month ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

Any-to-any multimodal LLM research paper

Created 2 years ago

Updated 8 months ago

VisualGLM-6B by zai-org

Multimodal dialog language model for images, Chinese, and English

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.