MoAI  by ByungKwanLee

PyTorch code for vision-language model (VLM) research

created 1 year ago
326 stars

Top 84.8% on sourcepulse

GitHubView on GitHub
Project Summary

MoAI is a PyTorch implementation for a novel large language and vision model (LLVM) that enhances zero-shot vision-language tasks by integrating auxiliary information from specialized computer vision models. It targets researchers and practitioners seeking to improve LLVM performance on real-world scene understanding without increasing model size or requiring extensive dataset curation.

How It Works

MoAI leverages external computer vision models (segmentation, detection, scene graph generation, OCR) to extract auxiliary visual information. This information is then processed by two new modules: MoAI-Compressor, which aligns and condenses the external model outputs into a fixed number of tokens, and MoAI-Mixer, which blends these compressed features with standard visual and language features using a Mixture of Experts approach. This strategy aims to efficiently incorporate detailed scene understanding while mitigating potential performance degradation from imperfect external models.

Quick Start & Requirements

  • Installation: Requires conda for environment setup and PyTorch 2.0.1 with CUDA 11.8. Install dependencies via pip install -r assets/requirements/requirements.txt and pip install -r assets/requirements/requirements_custom.txt, including flash-attn.
  • Environment Variables: Set DETECTRON2_DATASETS, DATASET, DATASET2, and VLDATASET to point to your dataset locations.
  • Checkpoints: Download and place a specific Scene Graph Generation checkpoint (psgtr_r50_epoch_60.pth) in moai/sgg/checkpoints.
  • Code Modifications: Requires commenting out specific lines in mmdet/apis/inference.py and modifying others, as well as commenting out a line in mmcv/transforms/processing.py.
  • Demo: A six-step Python script demonstrates loading images, prompts, models (MoAI-7B, segmentation, object detection, OCR), processing, and generation.
  • Evaluation: Bash scripts are provided for evaluating zero-shot performance on various benchmarks using accelerate launch.
  • Resources: Requires significant GPU resources for evaluation and potentially for running the models.

Highlighted Details

  • Achieves state-of-the-art zero-shot performance on numerous vision-language benchmarks, particularly for tasks involving real-world scene understanding.
  • Outperforms both open-source and closed-source LLVMs without increasing model size or requiring additional visual instruction tuning datasets.
  • Integrates auxiliary features from specialized CV models via MoAI-Compressor and MoAI-Mixer, utilizing a Mixture of Experts architecture.
  • Available on Huggingface Spaces and includes benchmark score results on Google Drive.

Maintenance & Community

The project is associated with ECCV 2024 and the primary author is ByungKwanLee. Links to ArXiv, Huggingface, and potentially other community channels are provided.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training MoAI is not currently supported. The setup involves several manual code modifications and specific checkpoint downloads, indicating a complex integration process. The project relies on external CV models, whose performance can impact MoAI's results.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
2
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.