X-LLM  by phellonchen

Multimodal LLM research paper

created 2 years ago
312 stars

Top 87.5% on sourcepulse

GitHubView on GitHub
Project Summary

X-LLM is a framework for building multimodal large language models by treating different data modalities as foreign languages. It targets researchers and developers aiming to integrate capabilities like image, audio, or device status understanding into LLMs, leveraging the ChatGLM architecture. The primary benefit is enabling LLMs to process and reason about diverse data types beyond text.

How It Works

X-LLM employs a three-stage training process. First, it converts multimodal inputs into a "foreign language" representation using X2L interfaces, with only these interfaces being updated. Second, these representations are aligned with the LLM (ChatGLM), again updating only the X2L interfaces. Finally, multiple modalities are integrated, with updates restricted to adapters within the X2L interfaces. This staged approach, inspired by BLIP-2, allows for efficient bootstrapping of multimodal capabilities.

Quick Start & Requirements

  • Install: conda create -n lavis python=3.8, conda activate lavis, git clone https://github.com/phellonchen/X-LLM.git, cd X-LLM, pip install -e .
  • Prerequisites: Python 3.8, Conda. Specific dataset details are in README_DATA.md. Training and evaluation details are in README_TRAIN_EVAL.md.

Highlighted Details

  • Achieves 84.5% relative score compared to GPT-4 on a custom evaluation dataset of 90 language-image instructions.
  • Supports integrating various modalities, including non-speech audio and terminal device status.
  • Leverages ChatGLM for its Chinese language capabilities and follows the BLIP-2 model architecture.

Maintenance & Community

  • The project is based on ChatGLM and BLIP-2. Code release is pending.

Licensing & Compatibility

  • The README does not explicitly state a license. The project is presented as an academic research artifact.

Limitations & Caveats

  • The full codebase and specific training/evaluation details are not yet released. The project relies on the ChatGLM base model.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.