OV-DINO by wanghao9610

Research paper for open-vocabulary object detection

Created 1 year ago

389 stars

Top 73.8% on SourcePulse

Project Summary

OV-DINO provides a unified approach to open-vocabulary object detection, addressing the need for flexible and accurate detection across a wide range of categories. It is designed for researchers and practitioners in computer vision and deep learning who require state-of-the-art performance in zero-shot and fine-tuned detection tasks. The project offers significant improvements over previous methods, particularly in zero-shot evaluation on challenging benchmarks like COCO and LVIS.

How It Works

OV-DINO employs a Unified Data Integration pipeline for end-to-end pre-training on diverse datasets, including Objects365, GoldG, and CC1M. A key innovation is the Language-Aware Selective Fusion module, which enhances the model's vision-language understanding by selectively fusing information based on linguistic context. This approach leads to improved zero-shot capabilities and overall detection accuracy.

Quick Start & Requirements

Installation: Clone the repository, set CUDA_HOME if not using CUDA 11.6, create a conda environment (ovdino), install PyTorch 1.13.1 with CUDA 11.6, and then install the project dependencies using pip install -e detectron2-717ab9 and pip install -e ./. An optional environment (ovsam) is provided for OV-SAM integration.
Data: Requires downloading and organizing COCO, LVIS, and Objects365 datasets. Symbolic links are used to manage data paths.
Pre-trained Models: Download from the Model Zoo and place in inits/ovdino.
Resources: Evaluation on LVIS Val requires approximately 250GB of memory. Pre-training on Objects365 is demonstrated on 2 nodes with 8 A100 GPUs each.
Links: Paper, HuggingFace, Demo.

Highlighted Details

Achieves state-of-the-art zero-shot performance, with relative improvements of +2.5% AP on COCO and +12.7% AP on LVIS compared to G-DINO.
Offers fine-tuning code for custom datasets and pre-training code for the O365 dataset.
Includes local inference and web inference demos for easy deployment and testing.
Integrates with SAM2 for enhanced segmentation capabilities (OV-SAM).

Maintenance & Community

The project is actively updated, with recent releases including pre-training code for O365 and the OV-SAM integration. The authors are responsive to issues raised for fine-tuning.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it references other open-source projects like Detectron2, detrex, GLIP, G-DINO, and YOLO-World, suggesting a permissive open-source orientation. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The project is still under active development, with several features planned, including ONNX exporting and integration into 🤗 Transformers. The pre-training code for all datasets is noted as "Coming soon." The README mentions that uploaded images for the web demo are stored for failure analysis.

OV-DINO by wanghao9610

Explore Similar Projects

cobra by h-zhao1997

VinVL by pzzhang

Awesome-Open-Vocabulary-Semantic-Segmentation by Qinying-Liu

SLIP by facebookresearch

molmo by allenai

Vary by Ucas-HaoranWei

vilbert-multi-task by facebookresearch

yolox-pytorch by bubbliiiing

Bert-Multi-Label-Text-Classification by lonePatient

bert_language_understanding by brightmart

open_flamingo by mlfoundations

GroundingDINO by IDEA-Research