VinVL by pzzhang

Research paper for improved visual representations in vision-language models

Created 4 years ago

359 stars

Top 78.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Jinze Bai

Research Scientist at Alibaba Qwen

Jiaming Song

Chief Scientist at Luma AI

Project Summary

VinVL offers improved visual representations for vision-language (VL) tasks by introducing a novel object detection model. This model, pre-trained on extensive datasets, generates richer object-centric features that significantly boost performance across various VL benchmarks, targeting researchers and practitioners in the field.

How It Works

VinVL replaces traditional bottom-up/top-down visual feature extractors with a custom-designed, larger object detection model. This model is pre-trained on a combined corpus of multiple annotated object detection datasets, enabling it to capture a wider array of visual objects and concepts. By feeding these enhanced features into a Transformer-based VL fusion model like OSCAR, VinVL demonstrates substantial performance gains across multiple VL tasks.

Quick Start & Requirements

Feature Extraction: Requires a pretrained Faster RCNN model (X152-C4) and associated labelmap. Feature extraction is performed using Python scripts from the Scene Graph Benchmark Repo.
Dependencies: Python, PyTorch (implied by maskrcnn-benchmark), and potentially CUDA for GPU acceleration.
Resources: Pre-trained models and features are available for download. Specific command-line arguments suggest configuration for batch size, NMS filtering, and score thresholds.
Links:
- Pretrained Models: https://drive.google.com/file/d/1nvu8y4zZFbJqSqLdQClvMyzsbBeX-oaQ/view?usp=sharing
- Labelmap: https://drive.google.com/file/d/1M1nPtMPHS1GXx5HvKMMDxOksxQgwX14_/view?usp=sharing
- OSCAR Repo: Linked for VL pretraining and fine-tuning.

Highlighted Details

Achieved state-of-the-art results on seven public VL benchmarks, including VQA, Image Captioning, and NLVR2.
Demonstrates significant performance gains (up to 5.9% on NoCaps) by simply replacing visual features, attributing 95% of improvement to visual representation.
Offers pretrained Faster RCNN object-attribute detection models and feature extraction tools.
Provides pretrained OSCAR+ models and code for VL pretraining and downstream task fine-tuning.

Maintenance & Community

The project is associated with authors from Microsoft and is part of research efforts that have produced related works like OSCAR. Citations are provided for VinVL and OSCAR.

Licensing & Compatibility

The README does not explicitly state a license. However, the project is associated with OSCAR, which is typically released under a permissive license like MIT. Compatibility for commercial use would require explicit license confirmation.

Limitations & Caveats

The project relies on a custom object detection model and requires specific configurations for feature extraction. The primary focus is on improving visual features, with downstream VL model integration handled by the OSCAR repository. The README does not specify Python version requirements or detailed installation instructions beyond command-line examples.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days