object-centric-ovd by hanoonaR

Object detection research paper for open-vocabulary scenarios

Created 3 years ago

297 stars

Top 89.4% on SourcePulse

Project Summary

This repository provides the official implementation for "Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection," a NeurIPS 2022 paper. It addresses limitations in current open-vocabulary detection methods by aligning object-centric language embeddings and improving generalization to novel classes using image-level supervision. The target audience is researchers and practitioners in computer vision focused on object detection and open-vocabulary tasks.

How It Works

The approach bridges the gap between image-level and object-level representations for open-vocabulary detection. It introduces Region-based Knowledge Distillation (RKD) to adapt image-centric language embeddings (from CLIP) to be object-centric, improving localization. Additionally, Pseudo Image-level Supervision (PIS) leverages weak image-level supervision from multi-modal Vision Transformers (MAVL) to enhance generalization to novel classes via a pseudo-labeling process. A novel Weight Transfer function efficiently combines these two components, aggregating their complementary strengths for superior performance.

Quick Start & Requirements

Installation: Clone the repository and follow instructions in INSTALL.md.
Prerequisites: PyTorch 1.10.0, CUDA 11.3.
Training: Requires 8 A100 GPUs. Training times range from 4.5 hours to 2.5 days depending on the configuration.
Demo: An interactive Colab notebook is available for custom detector creation.
Documentation: Installation instructions are in INSTALL.md.

Highlighted Details

Achieves state-of-the-art results on COCO and LVIS benchmarks for open-vocabulary detection.
Demonstrates significant gains on novel classes: 40.3 AP50 on COCO (11.9 absolute gain) and 5.0 mask AP for rare categories on LVIS.
Ablation studies show that the proposed Weight Transfer method provides complimentary gains over naively adding RKD and PIS components.
Code is based on the Detic repository and utilizes the MViT model (MAVL).

Maintenance & Community

The paper was accepted at NeurIPS 2022.
Contact information for questions is provided via email. Issues can be raised on the repository.

Licensing & Compatibility

The repository does not explicitly state a license in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Training is resource-intensive, requiring multiple high-end GPUs (8xA100).
The README does not specify the license, which could impact commercial adoption.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

ZegCLIP by ZiqinZhou66

CLIP-based zero-shot semantic segmentation

Created 3 years ago

Updated 2 years ago

Starred by

Jinze Bai

Jinze Bai(Research Scientist at Alibaba Qwen) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

VinVL by pzzhang

Research paper for improved visual representations in vision-language models

Created 4 years ago

Updated 2 years ago

tokenize-anything by baaivision

Vision model for segmenting, recognizing, and captioning arbitrary regions

Created 2 years ago

Updated 1 year ago

mvits_for_class_agnostic_od by mmaaz60

Research paper for class-agnostic object detection

Created 4 years ago

Updated 2 years ago

geneval by djghosh13

Evaluation framework for text-to-image alignment research

Created 2 years ago

Updated 10 months ago

Awesome-Open-Vocabulary-Semantic-Segmentation by Qinying-Liu

Curated publication list for open vocabulary semantic segmentation

Created 2 years ago

Updated 2 months ago

Rex-Omni by IDEA-Research

Multimodal LLM for versatile visual perception via next-point prediction

Created 3 months ago

Updated 1 day ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

DetGPT by OptimalScale

Vision-language model for object detection via reasoning

Created 2 years ago

Updated 1 year ago

Starred by

Jianwei Yang

Jianwei Yang(Research Scientist at Meta Superintelligence Lab),

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI), and

5 more.

X-Decoder by microsoft

Generalized decoding model for pixel, image, and language tasks

Created 3 years ago

Updated 2 years ago

yolox-pytorch by bubbliiiing

PyTorch implementation for the YOLOX object detection model

Created 4 years ago

Updated 2 years ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

recognize-anything by xinyu1205

Image tagging models for common/open-set categories and comprehensive captioning

Created 2 years ago

Updated 10 months ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Kaichao You

Kaichao You(Core Maintainer of vLLM), and

2 more.

learnopencv by spmallick

Code examples for computer vision, deep learning, and AI research

Created 11 years ago

Updated 4 days ago

Feedback? Help us improve.