ContextDET by yuhangzang

Contextual object detection powered by multimodal large language models

Created 2 years ago

257 stars

Top 98.3% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> ContextDET introduces contextual object detection, addressing the gap in Multimodal Large Language Models (MLLMs) for essential perception abilities. It enables understanding visible objects within diverse human-AI interactive contexts, such as language cloze tests, visual captioning, and question answering. This benefits researchers and developers seeking to enhance MLLMs with robust object recognition capabilities beyond fixed class labels.

How It Works

The project employs a novel "generate-then-detect" framework. It comprises a visual encoder for image representations, a pre-trained LLM that decodes multimodal contextual tokens via a task-specific prefix, and a visual decoder predicting bounding boxes and scores for conditional queries linked to contextual object words. This architecture allows for the detection of objects corresponding to words within the general human vocabulary, a significant advancement over traditional object detection methods.

Quick Start & Requirements

Installation: pip install -r requirements.txt
Prerequisites: Python packages (detailed in requirements.txt), a checkpoint file (download required).
Execution: Run python app.py after setup.
Demo: Available on HuggingFace Spaces.
Links: HuggingFace Demo

Highlighted Details

Defines and addresses the novel research problem of "contextual object detection."
Enables detection of objects using words from the general human vocabulary, expanding beyond pre-defined class sets.
Provides a HuggingFace demo and model checkpoint for immediate experimentation.
Includes the CODE dataset for evaluation purposes.

Maintenance & Community

The project acknowledges contributions from several public codebases, including DETR, Deformable DETR, DETA, OV DETR, and BLIP2. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

Licensed under "S-Lab License 1.0". Redistribution and use are strictly for non-commercial purposes, imposing limitations on commercial applications.

Limitations & Caveats

Training scripts are currently unavailable, noted as "waiting to be cleaned up." The "S-Lab License 1.0" restricts usage to non-commercial contexts, posing a significant adoption blocker for commercial products.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days