lxmert  by airsplay

PyTorch code for cross-modality representation learning via Transformers

created 6 years ago
958 stars

Top 39.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the PyTorch implementation for LXMERT, a model designed for learning cross-modality encoder representations from Transformers. It is suitable for researchers and practitioners working on vision-and-language tasks such as Visual Question Answering (VQA), GQA, and NLVR2, offering pre-trained models and fine-tuning scripts to achieve state-of-the-art results.

How It Works

LXMERT utilizes a Transformer-based architecture with separate encoders for language, vision, and cross-modal interactions. It employs a multi-modal pre-training strategy on large datasets like MS COCO and Visual Genome, incorporating tasks such as masked language modeling, object prediction, and visual question answering. The model's design, with dedicated cross-modality layers and shared weights between attention sub-layers, aims to efficiently fuse information from both modalities into a joint representation.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Requires Python 3.
  • Pre-trained models are available for download.
  • Fine-tuning requires specific datasets (VQA, GQA, NLVR2) and corresponding Faster R-CNN features, which can be substantial (e.g., 17GB for VQA train features).
  • Pre-training requires significant computational resources (e.g., 4 GPUs for ~8.5 days).
  • Official quick-start and detailed fine-tuning instructions for VQA, GQA, and NLVR2 are provided.

Highlighted Details

  • Achieved top-3 ranking in VQA 2019 and GQA 2019 challenges.
  • Provides pre-trained models (870MB) and fine-tuning scripts for multiple vision-and-language tasks.
  • Supports feature extraction via a Docker image for compatibility with the Bottom-Up Attention Caffe implementation.
  • Model architecture (lxr955) includes 9 language, 5 cross-modality, and 5 object-relationship layers.

Maintenance & Community

The project is associated with Hao Tan and Mohit Bansal. Acknowledgements mention support from ARO-YIP, Google, Facebook, Salesforce, and Adobe. No specific community channels (like Discord/Slack) are listed.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The project acknowledges contributions and code from other sources, including Hugging Face Transformers and Bottom-Up-Attention.

Limitations & Caveats

  • The README mentions server issues and potential download speed limitations.
  • Pre-training requires substantial GPU resources and time.
  • Feature extraction relies on a specific Caffe version of Faster R-CNN, necessitating Docker for ease of use.
  • The provided pre-trained model is trained for 12 epochs, resulting in slightly lower downstream performance compared to the 20-epoch model mentioned in the paper.
Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.