ContextDET  by yuhangzang

Contextual object detection powered by multimodal large language models

Created 2 years ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> ContextDET introduces contextual object detection, addressing the gap in Multimodal Large Language Models (MLLMs) for essential perception abilities. It enables understanding visible objects within diverse human-AI interactive contexts, such as language cloze tests, visual captioning, and question answering. This benefits researchers and developers seeking to enhance MLLMs with robust object recognition capabilities beyond fixed class labels.

How It Works

The project employs a novel "generate-then-detect" framework. It comprises a visual encoder for image representations, a pre-trained LLM that decodes multimodal contextual tokens via a task-specific prefix, and a visual decoder predicting bounding boxes and scores for conditional queries linked to contextual object words. This architecture allows for the detection of objects corresponding to words within the general human vocabulary, a significant advancement over traditional object detection methods.

Quick Start & Requirements

  • Installation: pip install -r requirements.txt
  • Prerequisites: Python packages (detailed in requirements.txt), a checkpoint file (download required).
  • Execution: Run python app.py after setup.
  • Demo: Available on HuggingFace Spaces.
  • Links: HuggingFace Demo

Highlighted Details

  • Defines and addresses the novel research problem of "contextual object detection."
  • Enables detection of objects using words from the general human vocabulary, expanding beyond pre-defined class sets.
  • Provides a HuggingFace demo and model checkpoint for immediate experimentation.
  • Includes the CODE dataset for evaluation purposes.

Maintenance & Community

The project acknowledges contributions from several public codebases, including DETR, Deformable DETR, DETA, OV DETR, and BLIP2. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

Licensed under "S-Lab License 1.0". Redistribution and use are strictly for non-commercial purposes, imposing limitations on commercial applications.

Limitations & Caveats

Training scripts are currently unavailable, noted as "waiting to be cleaned up." The "S-Lab License 1.0" restricts usage to non-commercial contexts, posing a significant adoption blocker for commercial products.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.