DetGPT  by OptimalScale

Vision-language model for object detection via reasoning

created 2 years ago
780 stars

Top 45.7% on sourcepulse

GitHubView on GitHub
Project Summary

DetGPT is a multimodal AI system designed for precise object localization within images based on complex, natural language instructions. It targets researchers and developers in computer vision and natural language processing who need to go beyond simple image description to identify specific, contextually relevant objects. The primary benefit is its ability to understand nuanced queries and accurately pinpoint items, even those requiring common-sense reasoning or knowledge of unfamiliar concepts.

How It Works

DetGPT integrates a large language model (LLM) with an open-vocabulary object detector (GroundingDino). The LLM reasons over the user's instruction to identify target objects and their attributes. This reasoning process is then translated into prompts for GroundingDino, which performs the actual visual detection. This approach allows DetGPT to handle complex instructions by leveraging the LLM's understanding and the detector's visual capabilities, enabling it to find objects based on functional descriptions or less common attributes.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n detgpt python=3.9), activate it, and install the package (pip install -e .). GroundingDino is installed separately (python -m pip install -e GroundingDino).
  • Prerequisites: Python 3.9, CUDA, COCO dataset, pretrained checkpoints (Robin, Vicuna), and task tuning dataset (coco_task_annotation.json). Merging LORA weights with base models (e.g., Llama) is required.
  • Demo: Run locally with CUDA_VISIBLE_DEVICES=0,1 python demo_detgpt.py --cfg-path configs/detgpt_eval_13b.yaml. Requires 2 GPUs.
  • Resources: Downloading checkpoints and datasets is necessary. Training requires multiple GPUs (e.g., torchrun --nproc-per-node 8 train.py).
  • Links: Project Website, Demo, Paper.

Highlighted Details

  • Localizes target objects based on LLM reasoning, not just descriptions.
  • Handles complex instructions like "Find blood pressure-reducing foods."
  • Can identify unfamiliar objects based on learned attributes (e.g., potassium-rich fruits).
  • Built upon GroundingDino and MiniGPT-4 (BLIP2, Lavis).

Maintenance & Community

The project was initiated by OptimalScale. The primary release was in May 2023, with updates to tuned weights in June 2023. Further community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

The README indicates that linear weights for models other than Vicuna-13B-v1.1 will be released later, suggesting current limited model support. The setup requires manual merging of LORA weights with base models, which can be complex.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.