LLMDet  by iSEE-Laboratory

Open-vocabulary object detection using LLM supervision

created 6 months ago
362 stars

Top 77.3% on SourcePulse

GitHubView on GitHub
Project Summary

LLMDet is a PyTorch implementation for learning strong open-vocabulary object detectors by leveraging large language models (LLMs) for caption generation. It targets researchers and practitioners in computer vision, offering improved open-vocabulary detection capabilities and enabling the creation of more robust multi-modal models.

How It Works

LLMDet fine-tunes an existing open-vocabulary detector using a custom dataset, GroundingCap-1M, which pairs images with grounding labels and detailed captions. The core innovation lies in using LLMs to generate both region-level and image-level captions, which are then used as auxiliary supervision during training alongside standard grounding losses. This approach enhances the detector's ability to understand and localize objects across a wide range of vocabulary.

Quick Start & Requirements

  • Installation: The project is now merged into transformers==4.55.0. Install via pip install transformers.
  • Prerequisites: PyTorch (>= 2.2.1+cu121 recommended), transformers==4.37.2 (for older versions), mmcv==2.2.0, mmengine==0.10.5, numpy (< 1.24), nltk, wandb. GPU with CUDA 12.1 is recommended.
  • Demo: A Gradio demo is available on Hugging Face.
  • Resources: Requires downloading pre-trained checkpoints for mm_grounding_dino (Swin-T, Swin-B, Swin-L) and potentially a fine-tuned LLM.

Highlighted Details

  • Achieved highlight paper status at CVPR 2025.
  • Outperforms baseline detectors with superior open-vocabulary ability.
  • Demonstrates mutual benefits by improving LLMDet's ability to build stronger multi-modal models.
  • Offers pre-trained checkpoints on Hugging Face and ModelScope.

Maintenance & Community

The project is actively maintained, with recent updates including merging into the official transformers library and the release of Hugging Face demos and checkpoints. Links to Hugging Face and ModelScope are provided for models and data.

Licensing & Compatibility

LLMDet is released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that specific numpy versions (< 1.24) are recommended for compatibility. While the project is merged into transformers, users relying on older versions might need to manage dependency compatibility carefully.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
8
Star History
82 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.