Vision-language pre-training research paper for image captioning and VQA
Top 71.1% on sourcepulse
This repository provides code for Unified Vision-Language Pre-training (VLP), a framework for joint pre-training on image captioning and visual question answering (VQA) tasks. It offers pre-trained models and fine-tuning scripts for datasets like COCO Captions and VQA 2.0, targeting researchers and practitioners in multimodal AI.
How It Works
VLP leverages a unified Transformer architecture, inspired by UniLM, to handle both sequence-to-sequence (captioning) and sequence-to-sequence (VQA) tasks within a single model. It employs a bidirectional Transformer for understanding image-text relationships and a unidirectional Transformer for generation, enabling flexible pre-training and fine-tuning strategies. The approach utilizes region features from Detectron for richer visual representations.
Quick Start & Requirements
misc/vlp.yml
) and running ./setup.sh
. Alternative: Docker image (luzhou/vlp
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
num_workers=0
in DataLoader, recommending single-GPU inference.3 years ago
Inactive