Instruction tuning LLM on regions-of-interest for visual understanding
Top 59.7% on sourcepulse
GPT4RoI is an instruction-tuned large language model designed for region-of-interest (RoI) understanding in visual question answering. It targets researchers and developers working on multimodal AI, enabling more precise visual reasoning by allowing users to specify and refer to regions within images.
How It Works
GPT4RoI builds upon the LLaVA architecture and Vicuna LLM, incorporating region-specific information through instruction tuning. It leverages a dataset comprising multiple grounding datasets (RefCOCO, RefCOCO+, RefCOCOg, Visual Genome, Flickr30K entities) and the VCR dataset to enhance its ability to understand and reason about specified image regions. The model uses special tokens like <region1>
to reference these regions within conversational contexts.
Quick Start & Requirements
conda create -n gpt4roi python=3.10
), install dependencies (pip install -e .
), re-install PyTorch with CUDA 11.7 support (conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
), install flash-attn
, and build mmcv-1.4.7
with matching CUDA versions.flash-attn
, mmcv-1.4.7
.python gpt4roi/app.py
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The dataset organization section is noted as potentially messy and still under development, with efforts underway to unify the format. The process of applying delta weights requires significant CPU RAM.
2 months ago
1 day