DetGPT by OptimalScale

Vision-language model for object detection via reasoning

Created 2 years ago

791 stars

Top 44.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

DetGPT is a multimodal AI system designed for precise object localization within images based on complex, natural language instructions. It targets researchers and developers in computer vision and natural language processing who need to go beyond simple image description to identify specific, contextually relevant objects. The primary benefit is its ability to understand nuanced queries and accurately pinpoint items, even those requiring common-sense reasoning or knowledge of unfamiliar concepts.

How It Works

DetGPT integrates a large language model (LLM) with an open-vocabulary object detector (GroundingDino). The LLM reasons over the user's instruction to identify target objects and their attributes. This reasoning process is then translated into prompts for GroundingDino, which performs the actual visual detection. This approach allows DetGPT to handle complex instructions by leveraging the LLM's understanding and the detector's visual capabilities, enabling it to find objects based on functional descriptions or less common attributes.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n detgpt python=3.9), activate it, and install the package (pip install -e .). GroundingDino is installed separately (python -m pip install -e GroundingDino).
Prerequisites: Python 3.9, CUDA, COCO dataset, pretrained checkpoints (Robin, Vicuna), and task tuning dataset (coco_task_annotation.json). Merging LORA weights with base models (e.g., Llama) is required.
Demo: Run locally with CUDA_VISIBLE_DEVICES=0,1 python demo_detgpt.py --cfg-path configs/detgpt_eval_13b.yaml. Requires 2 GPUs.
Resources: Downloading checkpoints and datasets is necessary. Training requires multiple GPUs (e.g., torchrun --nproc-per-node 8 train.py).
Links: Project Website, Demo, Paper.

Highlighted Details

Localizes target objects based on LLM reasoning, not just descriptions.
Handles complex instructions like "Find blood pressure-reducing foods."
Can identify unfamiliar objects based on learned attributes (e.g., potassium-rich fruits).
Built upon GroundingDino and MiniGPT-4 (BLIP2, Lavis).

Maintenance & Community

The project was initiated by OptimalScale. The primary release was in May 2023, with updates to tuned weights in June 2023. Further community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

License: BSD 3-Clause License.
Compatibility: Permissive license suitable for commercial use and integration into closed-source applications.

Limitations & Caveats

The README indicates that linear weights for models other than Vicuna-13B-v1.1 will be released later, suggesting current limited model support. The setup requires manual merging of LORA weights with base models, which can be complex.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days