VLP  by LuoweiZhou

Vision-language pre-training research paper for image captioning and VQA

created 5 years ago
419 stars

Top 71.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code for Unified Vision-Language Pre-training (VLP), a framework for joint pre-training on image captioning and visual question answering (VQA) tasks. It offers pre-trained models and fine-tuning scripts for datasets like COCO Captions and VQA 2.0, targeting researchers and practitioners in multimodal AI.

How It Works

VLP leverages a unified Transformer architecture, inspired by UniLM, to handle both sequence-to-sequence (captioning) and sequence-to-sequence (VQA) tasks within a single model. It employs a bidirectional Transformer for understanding image-text relationships and a unidirectional Transformer for generation, enabling flexible pre-training and fine-tuning strategies. The approach utilizes region features from Detectron for richer visual representations.

Quick Start & Requirements

  • Installation: Recommended: Conda environment setup (misc/vlp.yml) and running ./setup.sh. Alternative: Docker image (luzhou/vlp).
  • Prerequisites: CUDA (e.g., 10.0), CUDNN (e.g., v7.5), Miniconda. Large datasets (COCO: 95GB+, Flickr30k: 27GB+, Conceptual Captions: 6GB+, Region Features: 509GB+).
  • Setup Time: Significant due to large data downloads and feature extraction.
  • Links: Pre-trained models, Fine-tuning checkpoints

Highlighted Details

  • Achieves state-of-the-art results on COCO Captions (BLEU@4: 39.5, CIDEr: 129.3 with SCST) and VQA 2.0 (70.7 overall accuracy).
  • Supports both single-GPU and distributed training (up to 8x V100 GPUs).
  • Includes scripts for pre-training on Conceptual Captions and fine-tuning on COCO, Flickr30k, and VQA 2.0.
  • Provides Detectron-based feature extraction code.

Maintenance & Community

  • Based on UniLM, pytorch-transformers v0.4.0, and ImageCaptioning.pytorch.
  • No explicit community links (Discord/Slack) or roadmap mentioned.

Licensing & Compatibility

  • License: "The license found in the LICENSE file".
  • Compatibility: Code is based on other projects, implying potential licensing considerations. Commercial use is not explicitly detailed.

Limitations & Caveats

  • Requires substantial disk space and GPU resources for data and training.
  • Data preparation is complex, involving downloading and uncompressing multiple large files.
  • The README mentions potential data loading bottlenecks with num_workers=0 in DataLoader, recommending single-GPU inference.
Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.