OV-DINO  by wanghao9610

Research paper for open-vocabulary object detection

Created 1 year ago
367 stars

Top 76.8% on SourcePulse

GitHubView on GitHub
Project Summary

OV-DINO provides a unified approach to open-vocabulary object detection, addressing the need for flexible and accurate detection across a wide range of categories. It is designed for researchers and practitioners in computer vision and deep learning who require state-of-the-art performance in zero-shot and fine-tuned detection tasks. The project offers significant improvements over previous methods, particularly in zero-shot evaluation on challenging benchmarks like COCO and LVIS.

How It Works

OV-DINO employs a Unified Data Integration pipeline for end-to-end pre-training on diverse datasets, including Objects365, GoldG, and CC1M. A key innovation is the Language-Aware Selective Fusion module, which enhances the model's vision-language understanding by selectively fusing information based on linguistic context. This approach leads to improved zero-shot capabilities and overall detection accuracy.

Quick Start & Requirements

  • Installation: Clone the repository, set CUDA_HOME if not using CUDA 11.6, create a conda environment (ovdino), install PyTorch 1.13.1 with CUDA 11.6, and then install the project dependencies using pip install -e detectron2-717ab9 and pip install -e ./. An optional environment (ovsam) is provided for OV-SAM integration.
  • Data: Requires downloading and organizing COCO, LVIS, and Objects365 datasets. Symbolic links are used to manage data paths.
  • Pre-trained Models: Download from the Model Zoo and place in inits/ovdino.
  • Resources: Evaluation on LVIS Val requires approximately 250GB of memory. Pre-training on Objects365 is demonstrated on 2 nodes with 8 A100 GPUs each.
  • Links: Paper, HuggingFace, Demo.

Highlighted Details

  • Achieves state-of-the-art zero-shot performance, with relative improvements of +2.5% AP on COCO and +12.7% AP on LVIS compared to G-DINO.
  • Offers fine-tuning code for custom datasets and pre-training code for the O365 dataset.
  • Includes local inference and web inference demos for easy deployment and testing.
  • Integrates with SAM2 for enhanced segmentation capabilities (OV-SAM).

Maintenance & Community

The project is actively updated, with recent releases including pre-training code for O365 and the OV-SAM integration. The authors are responsive to issues raised for fine-tuning.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it references other open-source projects like Detectron2, detrex, GLIP, G-DINO, and YOLO-World, suggesting a permissive open-source orientation. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The project is still under active development, with several features planned, including ONNX exporting and integration into 🤗 Transformers. The pre-training code for all datasets is noted as "Coming soon." The README mentions that uploaded images for the web demo are stored for failure analysis.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Luis Capelo Luis Capelo(Cofounder of Lightning AI).

GroundingDINO by IDEA-Research

0.5%
9k
Object detection via grounded pre-training research paper
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.