LLMDet by iSEE-Laboratory

Open-vocabulary object detection using LLM supervision

Created 11 months ago

533 stars

Top 59.4% on SourcePulse

Project Summary

LLMDet is a PyTorch implementation for learning strong open-vocabulary object detectors by leveraging large language models (LLMs) for caption generation. It targets researchers and practitioners in computer vision, offering improved open-vocabulary detection capabilities and enabling the creation of more robust multi-modal models.

How It Works

LLMDet fine-tunes an existing open-vocabulary detector using a custom dataset, GroundingCap-1M, which pairs images with grounding labels and detailed captions. The core innovation lies in using LLMs to generate both region-level and image-level captions, which are then used as auxiliary supervision during training alongside standard grounding losses. This approach enhances the detector's ability to understand and localize objects across a wide range of vocabulary.

Quick Start & Requirements

Installation: The project is now merged into transformers==4.55.0. Install via pip install transformers.
Prerequisites: PyTorch (>= 2.2.1+cu121 recommended), transformers==4.37.2 (for older versions), mmcv==2.2.0, mmengine==0.10.5, numpy (< 1.24), nltk, wandb. GPU with CUDA 12.1 is recommended.
Demo: A Gradio demo is available on Hugging Face.
Resources: Requires downloading pre-trained checkpoints for mm_grounding_dino (Swin-T, Swin-B, Swin-L) and potentially a fine-tuned LLM.

Highlighted Details

Achieved highlight paper status at CVPR 2025.
Outperforms baseline detectors with superior open-vocabulary ability.
Demonstrates mutual benefits by improving LLMDet's ability to build stronger multi-modal models.
Offers pre-trained checkpoints on Hugging Face and ModelScope.

Maintenance & Community

The project is actively maintained, with recent updates including merging into the official transformers library and the release of Hugging Face demos and checkpoints. Links to Hugging Face and ModelScope are provided for models and data.

Licensing & Compatibility

LLMDet is released under the Apache 2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README notes that specific numpy versions (< 1.24) are recommended for compatibility. While the project is merged into transformers, users relying on older versions might need to manage dependency compatibility carefully.

LLMDet by iSEE-Laboratory

Explore Similar Projects

lens by ContextualAI

ContextDET by yuhangzang

awesome-described-object-detection by Charles-Xie

mvits_for_class_agnostic_od by mmaaz60

FG-CLIP by 360CVGroup

VisualGPT by Vision-CAIR

VLP by LuoweiZhou

happy-transformer by EricFillion

Rex-Omni by IDEA-Research

Show-o by showlab

Oscar by microsoft

setfit by huggingface