GPT4RoI  by jshilong

Instruction tuning LLM on regions-of-interest for visual understanding

created 2 years ago
539 stars

Top 59.7% on sourcepulse

GitHubView on GitHub
Project Summary

GPT4RoI is an instruction-tuned large language model designed for region-of-interest (RoI) understanding in visual question answering. It targets researchers and developers working on multimodal AI, enabling more precise visual reasoning by allowing users to specify and refer to regions within images.

How It Works

GPT4RoI builds upon the LLaVA architecture and Vicuna LLM, incorporating region-specific information through instruction tuning. It leverages a dataset comprising multiple grounding datasets (RefCOCO, RefCOCO+, RefCOCOg, Visual Genome, Flickr30K entities) and the VCR dataset to enhance its ability to understand and reason about specified image regions. The model uses special tokens like <region1> to reference these regions within conversational contexts.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n gpt4roi python=3.10), install dependencies (pip install -e .), re-install PyTorch with CUDA 11.7 support (conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia), install flash-attn, and build mmcv-1.4.7 with matching CUDA versions.
  • Prerequisites: Python 3.10, CUDA 11.7, PyTorch, flash-attn, mmcv-1.4.7.
  • Weights: Requires downloading original LLaMA-7B weights and applying GPT4RoI-7B delta weights (approx. 30GB CPU RAM for conversion).
  • Data: Datasets need to be downloaded and organized according to the specified structure.
  • Demo: A Gradio demo is available via python gpt4roi/app.py.
  • Docs: Demo, Paper

Highlighted Details

  • Instruction tuning on a curated dataset of multiple grounding and VCR datasets.
  • Supports single and multiple region understanding with conversational referencing.
  • Delta weights available for GPT4RoI-7B, requiring combination with LLaMA-7B.
  • Training code provided for stage 1 (Vicuna-based) and stage 2.

Maintenance & Community

  • Codebase built upon LLaVA.
  • Updates include release of GPT4RoI-7B-delta-V0 and full code release.
  • Project acknowledges LLaVA, Vicuna, and VCR dataset creators.

Licensing & Compatibility

  • Relies on LLaMA weights, which have licensing restrictions.
  • Delta weights are provided separately. Compatibility with commercial or closed-source applications depends on LLaMA's license.

Limitations & Caveats

The dataset organization section is noted as potentially messy and still under development, with efforts underway to unify the format. The process of applying delta weights requires significant CPU RAM.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.