lxmert  by airsplay

PyTorch code for cross-modality representation learning via Transformers

Created 6 years ago
966 stars

Top 38.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the PyTorch implementation for LXMERT, a model designed for learning cross-modality encoder representations from Transformers. It is suitable for researchers and practitioners working on vision-and-language tasks such as Visual Question Answering (VQA), GQA, and NLVR2, offering pre-trained models and fine-tuning scripts to achieve state-of-the-art results.

How It Works

LXMERT utilizes a Transformer-based architecture with separate encoders for language, vision, and cross-modal interactions. It employs a multi-modal pre-training strategy on large datasets like MS COCO and Visual Genome, incorporating tasks such as masked language modeling, object prediction, and visual question answering. The model's design, with dedicated cross-modality layers and shared weights between attention sub-layers, aims to efficiently fuse information from both modalities into a joint representation.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Requires Python 3.
  • Pre-trained models are available for download.
  • Fine-tuning requires specific datasets (VQA, GQA, NLVR2) and corresponding Faster R-CNN features, which can be substantial (e.g., 17GB for VQA train features).
  • Pre-training requires significant computational resources (e.g., 4 GPUs for ~8.5 days).
  • Official quick-start and detailed fine-tuning instructions for VQA, GQA, and NLVR2 are provided.

Highlighted Details

  • Achieved top-3 ranking in VQA 2019 and GQA 2019 challenges.
  • Provides pre-trained models (870MB) and fine-tuning scripts for multiple vision-and-language tasks.
  • Supports feature extraction via a Docker image for compatibility with the Bottom-Up Attention Caffe implementation.
  • Model architecture (lxr955) includes 9 language, 5 cross-modality, and 5 object-relationship layers.

Maintenance & Community

The project is associated with Hao Tan and Mohit Bansal. Acknowledgements mention support from ARO-YIP, Google, Facebook, Salesforce, and Adobe. No specific community channels (like Discord/Slack) are listed.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The project acknowledges contributions and code from other sources, including Hugging Face Transformers and Bottom-Up-Attention.

Limitations & Caveats

  • The README mentions server issues and potential download speed limitations.
  • Pre-training requires substantial GPU resources and time.
  • Feature extraction relies on a specific Caffe version of Faster R-CNN, necessitating Docker for ease of use.
  • The provided pre-trained model is trained for 12 epochs, resulting in slightly lower downstream performance compared to the 20-epoch model mentioned in the paper.
Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.