GPT4RoI by jshilong

Instruction tuning LLM on regions-of-interest for visual understanding

Created 2 years ago

549 stars

Top 58.2% on SourcePulse

Project Summary

GPT4RoI is an instruction-tuned large language model designed for region-of-interest (RoI) understanding in visual question answering. It targets researchers and developers working on multimodal AI, enabling more precise visual reasoning by allowing users to specify and refer to regions within images.

How It Works

GPT4RoI builds upon the LLaVA architecture and Vicuna LLM, incorporating region-specific information through instruction tuning. It leverages a dataset comprising multiple grounding datasets (RefCOCO, RefCOCO+, RefCOCOg, Visual Genome, Flickr30K entities) and the VCR dataset to enhance its ability to understand and reason about specified image regions. The model uses special tokens like <region1> to reference these regions within conversational contexts.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n gpt4roi python=3.10), install dependencies (pip install -e .), re-install PyTorch with CUDA 11.7 support (conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia), install flash-attn, and build mmcv-1.4.7 with matching CUDA versions.
Prerequisites: Python 3.10, CUDA 11.7, PyTorch, flash-attn, mmcv-1.4.7.
Weights: Requires downloading original LLaMA-7B weights and applying GPT4RoI-7B delta weights (approx. 30GB CPU RAM for conversion).
Data: Datasets need to be downloaded and organized according to the specified structure.
Demo: A Gradio demo is available via python gpt4roi/app.py.
Docs: Demo, Paper

Highlighted Details

Instruction tuning on a curated dataset of multiple grounding and VCR datasets.
Supports single and multiple region understanding with conversational referencing.
Delta weights available for GPT4RoI-7B, requiring combination with LLaMA-7B.
Training code provided for stage 1 (Vicuna-based) and stage 2.

Maintenance & Community

Codebase built upon LLaVA.
Updates include release of GPT4RoI-7B-delta-V0 and full code release.
Project acknowledges LLaVA, Vicuna, and VCR dataset creators.

Licensing & Compatibility

Relies on LLaMA weights, which have licensing restrictions.
Delta weights are provided separately. Compatibility with commercial or closed-source applications depends on LLaMA's license.

Limitations & Caveats

The dataset organization section is noted as potentially messy and still under development, with efforts underway to unify the format. The process of applying delta weights requires significant CPU RAM.

GPT4RoI by jshilong

Explore Similar Projects

SmartEdit by TencentARC

VARGPT by VARGPT-family

ViP-LLaVA by WisconsinAIVision

Open-LLaVA-NeXT by xiaoachen98

SEED-X by AILab-CVC

LLMGA by JIA-Lab-research

ToolkenGPT by Ber666

Osprey by CircleRadon

MiniGPT-4-ZH by RiseInRose

Video-LLaVA by PKU-YuanGroup

NExT-GPT by NExT-GPT

LLaVA by haotian-liu