Vision-language pre-training research paper
Top 36.5% on sourcepulse
This repository provides code and models for Oscar and VinVL, advanced pre-training methods for vision-language tasks. It targets researchers and practitioners in NLP and computer vision, enabling state-of-the-art performance on tasks like image captioning and visual question answering.
How It Works
Oscar utilizes object tags detected in images as anchors to facilitate image-text alignment during pre-training. VinVL, an evolution of Oscar, revisits visual representations, offering improved object-attribute detection for enhanced performance on vision-language tasks. This object-centric approach simplifies cross-modal learning and achieves superior results.
Quick Start & Requirements
Installation instructions are available in INSTALL.md
. Pre-trained models, datasets, and VinVL image features can be found in VinVL_DOWNLOAD.md
and DOWNLOAD.md
. Scripts for downstream finetuning are in MODEL_ZOO.md
and VinVL_MODEL_ZOO.md
.
Highlighted Details
Maintenance & Community
The project is associated with Microsoft Research. Updates include visual instruction tuning with GPT-4 (LLaVA).
Licensing & Compatibility
Oscar is released under the MIT license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The README mentions the release of Oscar+ pretraining code and VinVL features, but specific details on dependencies or setup complexity beyond general installation instructions are not immediately apparent.
1 year ago
Inactive