VLP by LuoweiZhou

Vision-language pre-training research paper for image captioning and VQA

Created 6 years ago

424 stars

Top 69.5% on SourcePulse

Project Summary

This repository provides code for Unified Vision-Language Pre-training (VLP), a framework for joint pre-training on image captioning and visual question answering (VQA) tasks. It offers pre-trained models and fine-tuning scripts for datasets like COCO Captions and VQA 2.0, targeting researchers and practitioners in multimodal AI.

How It Works

VLP leverages a unified Transformer architecture, inspired by UniLM, to handle both sequence-to-sequence (captioning) and sequence-to-sequence (VQA) tasks within a single model. It employs a bidirectional Transformer for understanding image-text relationships and a unidirectional Transformer for generation, enabling flexible pre-training and fine-tuning strategies. The approach utilizes region features from Detectron for richer visual representations.

Quick Start & Requirements

Installation: Recommended: Conda environment setup (misc/vlp.yml) and running ./setup.sh. Alternative: Docker image (luzhou/vlp).
Prerequisites: CUDA (e.g., 10.0), CUDNN (e.g., v7.5), Miniconda. Large datasets (COCO: 95GB+, Flickr30k: 27GB+, Conceptual Captions: 6GB+, Region Features: 509GB+).
Setup Time: Significant due to large data downloads and feature extraction.
Links: Pre-trained models, Fine-tuning checkpoints

Highlighted Details

Achieves state-of-the-art results on COCO Captions (BLEU@4: 39.5, CIDEr: 129.3 with SCST) and VQA 2.0 (70.7 overall accuracy).
Supports both single-GPU and distributed training (up to 8x V100 GPUs).
Includes scripts for pre-training on Conceptual Captions and fine-tuning on COCO, Flickr30k, and VQA 2.0.
Provides Detectron-based feature extraction code.

Maintenance & Community

Based on UniLM, pytorch-transformers v0.4.0, and ImageCaptioning.pytorch.
No explicit community links (Discord/Slack) or roadmap mentioned.

Licensing & Compatibility

License: "The license found in the LICENSE file".
Compatibility: Code is based on other projects, implying potential licensing considerations. Commercial use is not explicitly detailed.

Limitations & Caveats

Requires substantial disk space and GPU resources for data and training.
Data preparation is complex, involving downloading and uncompressing multiple large files.
The README mentions potential data loading bottlenecks with num_workers=0 in DataLoader, recommending single-GPU inference.

Health Check

Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days