VinVL  by pzzhang

Research paper for improved visual representations in vision-language models

Created 4 years ago
358 stars

Top 78.0% on SourcePulse

GitHubView on GitHub
Project Summary

VinVL offers improved visual representations for vision-language (VL) tasks by introducing a novel object detection model. This model, pre-trained on extensive datasets, generates richer object-centric features that significantly boost performance across various VL benchmarks, targeting researchers and practitioners in the field.

How It Works

VinVL replaces traditional bottom-up/top-down visual feature extractors with a custom-designed, larger object detection model. This model is pre-trained on a combined corpus of multiple annotated object detection datasets, enabling it to capture a wider array of visual objects and concepts. By feeding these enhanced features into a Transformer-based VL fusion model like OSCAR, VinVL demonstrates substantial performance gains across multiple VL tasks.

Quick Start & Requirements

Highlighted Details

  • Achieved state-of-the-art results on seven public VL benchmarks, including VQA, Image Captioning, and NLVR2.
  • Demonstrates significant performance gains (up to 5.9% on NoCaps) by simply replacing visual features, attributing 95% of improvement to visual representation.
  • Offers pretrained Faster RCNN object-attribute detection models and feature extraction tools.
  • Provides pretrained OSCAR+ models and code for VL pretraining and downstream task fine-tuning.

Maintenance & Community

The project is associated with authors from Microsoft and is part of research efforts that have produced related works like OSCAR. Citations are provided for VinVL and OSCAR.

Licensing & Compatibility

The README does not explicitly state a license. However, the project is associated with OSCAR, which is typically released under a permissive license like MIT. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The project relies on a custom object detection model and requires specific configurations for feature extraction. The primary focus is on improving visual features, with downstream VL model integration handled by the OSCAR repository. The README does not specify Python version requirements or detailed installation instructions beyond command-line examples.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Luis Capelo Luis Capelo(Cofounder of Lightning AI).

GroundingDINO by IDEA-Research

0.5%
9k
Object detection via grounded pre-training research paper
Created 2 years ago
Updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Deshraj Yadav Deshraj Yadav(Cofounder of Mem0), and
7 more.

rcnn by rbgirshick

0.2%
2k
Object detection system using CNNs and region proposals
Created 11 years ago
Updated 8 years ago
Feedback? Help us improve.