VinVL  by pzzhang

Research paper for improved visual representations in vision-language models

created 4 years ago
356 stars

Top 79.5% on sourcepulse

GitHubView on GitHub
Project Summary

VinVL offers improved visual representations for vision-language (VL) tasks by introducing a novel object detection model. This model, pre-trained on extensive datasets, generates richer object-centric features that significantly boost performance across various VL benchmarks, targeting researchers and practitioners in the field.

How It Works

VinVL replaces traditional bottom-up/top-down visual feature extractors with a custom-designed, larger object detection model. This model is pre-trained on a combined corpus of multiple annotated object detection datasets, enabling it to capture a wider array of visual objects and concepts. By feeding these enhanced features into a Transformer-based VL fusion model like OSCAR, VinVL demonstrates substantial performance gains across multiple VL tasks.

Quick Start & Requirements

Highlighted Details

  • Achieved state-of-the-art results on seven public VL benchmarks, including VQA, Image Captioning, and NLVR2.
  • Demonstrates significant performance gains (up to 5.9% on NoCaps) by simply replacing visual features, attributing 95% of improvement to visual representation.
  • Offers pretrained Faster RCNN object-attribute detection models and feature extraction tools.
  • Provides pretrained OSCAR+ models and code for VL pretraining and downstream task fine-tuning.

Maintenance & Community

The project is associated with authors from Microsoft and is part of research efforts that have produced related works like OSCAR. Citations are provided for VinVL and OSCAR.

Licensing & Compatibility

The README does not explicitly state a license. However, the project is associated with OSCAR, which is typically released under a permissive license like MIT. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The project relies on a custom object detection model and requires specific configurations for feature extraction. The primary focus is on improving visual features, with downstream VL model integration handled by the OSCAR repository. The README does not specify Python version requirements or detailed installation instructions beyond command-line examples.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.