VLP  by LuoweiZhou

Vision-language pre-training research paper for image captioning and VQA

Created 6 years ago
424 stars

Top 69.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code for Unified Vision-Language Pre-training (VLP), a framework for joint pre-training on image captioning and visual question answering (VQA) tasks. It offers pre-trained models and fine-tuning scripts for datasets like COCO Captions and VQA 2.0, targeting researchers and practitioners in multimodal AI.

How It Works

VLP leverages a unified Transformer architecture, inspired by UniLM, to handle both sequence-to-sequence (captioning) and sequence-to-sequence (VQA) tasks within a single model. It employs a bidirectional Transformer for understanding image-text relationships and a unidirectional Transformer for generation, enabling flexible pre-training and fine-tuning strategies. The approach utilizes region features from Detectron for richer visual representations.

Quick Start & Requirements

  • Installation: Recommended: Conda environment setup (misc/vlp.yml) and running ./setup.sh. Alternative: Docker image (luzhou/vlp).
  • Prerequisites: CUDA (e.g., 10.0), CUDNN (e.g., v7.5), Miniconda. Large datasets (COCO: 95GB+, Flickr30k: 27GB+, Conceptual Captions: 6GB+, Region Features: 509GB+).
  • Setup Time: Significant due to large data downloads and feature extraction.
  • Links: Pre-trained models, Fine-tuning checkpoints

Highlighted Details

  • Achieves state-of-the-art results on COCO Captions (BLEU@4: 39.5, CIDEr: 129.3 with SCST) and VQA 2.0 (70.7 overall accuracy).
  • Supports both single-GPU and distributed training (up to 8x V100 GPUs).
  • Includes scripts for pre-training on Conceptual Captions and fine-tuning on COCO, Flickr30k, and VQA 2.0.
  • Provides Detectron-based feature extraction code.

Maintenance & Community

  • Based on UniLM, pytorch-transformers v0.4.0, and ImageCaptioning.pytorch.
  • No explicit community links (Discord/Slack) or roadmap mentioned.

Licensing & Compatibility

  • License: "The license found in the LICENSE file".
  • Compatibility: Code is based on other projects, implying potential licensing considerations. Commercial use is not explicitly detailed.

Limitations & Caveats

  • Requires substantial disk space and GPU resources for data and training.
  • Data preparation is complex, involving downloading and uncompressing multiple large files.
  • The README mentions potential data loading bottlenecks with num_workers=0 in DataLoader, recommending single-GPU inference.
Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0.1%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.