PyTorch code for vision-language model (VLM) research
Top 84.8% on sourcepulse
MoAI is a PyTorch implementation for a novel large language and vision model (LLVM) that enhances zero-shot vision-language tasks by integrating auxiliary information from specialized computer vision models. It targets researchers and practitioners seeking to improve LLVM performance on real-world scene understanding without increasing model size or requiring extensive dataset curation.
How It Works
MoAI leverages external computer vision models (segmentation, detection, scene graph generation, OCR) to extract auxiliary visual information. This information is then processed by two new modules: MoAI-Compressor, which aligns and condenses the external model outputs into a fixed number of tokens, and MoAI-Mixer, which blends these compressed features with standard visual and language features using a Mixture of Experts approach. This strategy aims to efficiently incorporate detailed scene understanding while mitigating potential performance degradation from imperfect external models.
Quick Start & Requirements
conda
for environment setup and PyTorch 2.0.1 with CUDA 11.8. Install dependencies via pip install -r assets/requirements/requirements.txt
and pip install -r assets/requirements/requirements_custom.txt
, including flash-attn
.DETECTRON2_DATASETS
, DATASET
, DATASET2
, and VLDATASET
to point to your dataset locations.psgtr_r50_epoch_60.pth
) in moai/sgg/checkpoints
.mmdet/apis/inference.py
and modifying others, as well as commenting out a line in mmcv/transforms/processing.py
.accelerate launch
.Highlighted Details
Maintenance & Community
The project is associated with ECCV 2024 and the primary author is ByungKwanLee. Links to ArXiv, Huggingface, and potentially other community channels are provided.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Training MoAI is not currently supported. The setup involves several manual code modifications and specific checkpoint downloads, indicating a complex integration process. The project relies on external CV models, whose performance can impact MoAI's results.
1 year ago
1 day