Vision-language model for object detection via reasoning
Top 45.7% on sourcepulse
DetGPT is a multimodal AI system designed for precise object localization within images based on complex, natural language instructions. It targets researchers and developers in computer vision and natural language processing who need to go beyond simple image description to identify specific, contextually relevant objects. The primary benefit is its ability to understand nuanced queries and accurately pinpoint items, even those requiring common-sense reasoning or knowledge of unfamiliar concepts.
How It Works
DetGPT integrates a large language model (LLM) with an open-vocabulary object detector (GroundingDino). The LLM reasons over the user's instruction to identify target objects and their attributes. This reasoning process is then translated into prompts for GroundingDino, which performs the actual visual detection. This approach allows DetGPT to handle complex instructions by leveraging the LLM's understanding and the detector's visual capabilities, enabling it to find objects based on functional descriptions or less common attributes.
Quick Start & Requirements
conda create -n detgpt python=3.9
), activate it, and install the package (pip install -e .
). GroundingDino is installed separately (python -m pip install -e GroundingDino
).coco_task_annotation.json
). Merging LORA weights with base models (e.g., Llama) is required.CUDA_VISIBLE_DEVICES=0,1 python demo_detgpt.py --cfg-path configs/detgpt_eval_13b.yaml
. Requires 2 GPUs.torchrun --nproc-per-node 8 train.py
).Highlighted Details
Maintenance & Community
The project was initiated by OptimalScale. The primary release was in May 2023, with updates to tuned weights in June 2023. Further community engagement channels are not explicitly listed in the README.
Licensing & Compatibility
Limitations & Caveats
The README indicates that linear weights for models other than Vicuna-13B-v1.1 will be released later, suggesting current limited model support. The setup requires manual merging of LORA weights with base models, which can be complex.
1 year ago
1 week