Open-vocabulary object detection using LLM supervision
Top 77.3% on SourcePulse
LLMDet is a PyTorch implementation for learning strong open-vocabulary object detectors by leveraging large language models (LLMs) for caption generation. It targets researchers and practitioners in computer vision, offering improved open-vocabulary detection capabilities and enabling the creation of more robust multi-modal models.
How It Works
LLMDet fine-tunes an existing open-vocabulary detector using a custom dataset, GroundingCap-1M, which pairs images with grounding labels and detailed captions. The core innovation lies in using LLMs to generate both region-level and image-level captions, which are then used as auxiliary supervision during training alongside standard grounding losses. This approach enhances the detector's ability to understand and localize objects across a wide range of vocabulary.
Quick Start & Requirements
transformers==4.55.0
. Install via pip install transformers
.transformers==4.37.2
(for older versions), mmcv==2.2.0
, mmengine==0.10.5
, numpy
(< 1.24), nltk
, wandb
. GPU with CUDA 12.1 is recommended.mm_grounding_dino
(Swin-T, Swin-B, Swin-L) and potentially a fine-tuned LLM.Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates including merging into the official transformers
library and the release of Hugging Face demos and checkpoints. Links to Hugging Face and ModelScope are provided for models and data.
Licensing & Compatibility
LLMDet is released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README notes that specific numpy
versions (< 1.24) are recommended for compatibility. While the project is merged into transformers
, users relying on older versions might need to manage dependency compatibility carefully.
1 week ago
Inactive