MoAI by ByungKwanLee

PyTorch code for vision-language model (VLM) research

Created 1 year ago

329 stars

Top 83.2% on SourcePulse

Project Summary

MoAI is a PyTorch implementation for a novel large language and vision model (LLVM) that enhances zero-shot vision-language tasks by integrating auxiliary information from specialized computer vision models. It targets researchers and practitioners seeking to improve LLVM performance on real-world scene understanding without increasing model size or requiring extensive dataset curation.

How It Works

MoAI leverages external computer vision models (segmentation, detection, scene graph generation, OCR) to extract auxiliary visual information. This information is then processed by two new modules: MoAI-Compressor, which aligns and condenses the external model outputs into a fixed number of tokens, and MoAI-Mixer, which blends these compressed features with standard visual and language features using a Mixture of Experts approach. This strategy aims to efficiently incorporate detailed scene understanding while mitigating potential performance degradation from imperfect external models.

Quick Start & Requirements

Installation: Requires conda for environment setup and PyTorch 2.0.1 with CUDA 11.8. Install dependencies via pip install -r assets/requirements/requirements.txt and pip install -r assets/requirements/requirements_custom.txt, including flash-attn.
Environment Variables: Set DETECTRON2_DATASETS, DATASET, DATASET2, and VLDATASET to point to your dataset locations.
Checkpoints: Download and place a specific Scene Graph Generation checkpoint (psgtr_r50_epoch_60.pth) in moai/sgg/checkpoints.
Code Modifications: Requires commenting out specific lines in mmdet/apis/inference.py and modifying others, as well as commenting out a line in mmcv/transforms/processing.py.
Demo: A six-step Python script demonstrates loading images, prompts, models (MoAI-7B, segmentation, object detection, OCR), processing, and generation.
Evaluation: Bash scripts are provided for evaluating zero-shot performance on various benchmarks using accelerate launch.
Resources: Requires significant GPU resources for evaluation and potentially for running the models.

Highlighted Details

Achieves state-of-the-art zero-shot performance on numerous vision-language benchmarks, particularly for tasks involving real-world scene understanding.
Outperforms both open-source and closed-source LLVMs without increasing model size or requiring additional visual instruction tuning datasets.
Integrates auxiliary features from specialized CV models via MoAI-Compressor and MoAI-Mixer, utilizing a Mixture of Experts architecture.
Available on Huggingface Spaces and includes benchmark score results on Google Drive.

Maintenance & Community

The project is associated with ECCV 2024 and the primary author is ByungKwanLee. Links to ArXiv, Huggingface, and potentially other community channels are provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training MoAI is not currently supported. The setup involves several manual code modifications and specific checkpoint downloads, indicating a complex integration process. The project relies on external CV models, whose performance can impact MoAI's results.

MoAI by ByungKwanLee

Explore Similar Projects

dots.vlm1 by rednote-hilab

Pixel-Reasoner by TIGER-AI-Lab

cobra by h-zhao1997

UniWorld by PKU-YuanGroup

seemore by AviSoori1x

LVM by ytongbai

VideoLLaMA3 by DAMO-NLP-SG

Vary by Ucas-HaoranWei

Qwen-VL by QwenLM

minimind-v by jingyaogong

X-AnyLabeling by CVHub520

prismatic-vlms by TRI-ML