mvits_for_class_agnostic_od  by mmaaz60

Research paper for class-agnostic object detection

created 3 years ago
312 stars

Top 87.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "Class-agnostic Object Detection with Multi-modal Transformer" (ECCV 2022). It addresses the limitations of traditional object detection methods in scaling to new domains and novel objects by leveraging multi-modal Vision Transformers (MViTs) trained with aligned image-text pairs. The primary audience is researchers and practitioners in computer vision, particularly those working on open-world object detection, salient object detection, and self-supervised detection tasks. The key benefit is achieving state-of-the-art performance in localizing generic objects, even those unseen during training, with enhanced interactability through language queries.

How It Works

The project utilizes Multi-modal Vision Transformers (MViTs), specifically proposing a novel architecture called Multiscale Attention ViT with Late fusion (MAVL). This approach integrates multi-scale feature processing and late vision-language fusion, departing from standard MViTs that often lack multi-scale capabilities and require longer training. The MAVL architecture employs multi-scale deformable attention, enabling it to capture richer object representations. By aligning image-text pairs during training, the MViTs learn to bridge the gap between object and image-level representations, facilitating class-agnostic detection.

Quick Start & Requirements

  • Installation:
    • Install PyTorch 1.8.0 with CUDA 11.1:
      pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
      
    • Install other dependencies:
      pip install -r requirements.txt
      
    • Compile Deformable Attention modules:
      cd models/ops
      sh make.sh
      
  • Prerequisites: PyTorch 1.8.0, torchvision 0.9.0, CUDA 11.1.
  • Resources: Pre-trained models for MAVL, Def-DETR, MDETR, DETReg, Faster-RCNN, RetinaNet, ORE, and others are available. Instructions to reproduce results are provided.
  • Links: Paper, Training, Applications, Evaluation.

Highlighted Details

  • Demonstrates state-of-the-art class-agnostic object detection performance across various datasets and out-of-domain scenarios.
  • Shows consistent generalization to new domains and rare/novel classes, even with limited or no prior exposure.
  • Offers enhanced interactability by adapting proposals based on specific language queries.
  • Explores the importance of language structure in object detection through experimental analysis.
  • Enables open-world object detection by using MAVL proposals for pseudo-labeling.

Maintenance & Community

The project is associated with authors from MBZUAI. Contact emails are provided for inquiries. Related works are also linked.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or closed-source linking.

Limitations & Caveats

The installation requires a specific older version of PyTorch (1.8.0) and CUDA (11.1), which may pose compatibility challenges with newer systems. The README does not specify the license, which is a critical factor for adoption.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Phil Wang Phil Wang(Prolific Research Paper Implementer), and
4 more.

vit-pytorch by lucidrains

0.3%
24k
PyTorch library for Vision Transformer variants and related techniques
created 4 years ago
updated 6 days ago
Feedback? Help us improve.