lxmert by airsplay

PyTorch code for cross-modality representation learning via Transformers

Created 6 years ago

966 stars

Top 38.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Thomas Wolf

Cofounder of Hugging Face

Project Summary

This repository provides the PyTorch implementation for LXMERT, a model designed for learning cross-modality encoder representations from Transformers. It is suitable for researchers and practitioners working on vision-and-language tasks such as Visual Question Answering (VQA), GQA, and NLVR2, offering pre-trained models and fine-tuning scripts to achieve state-of-the-art results.

How It Works

LXMERT utilizes a Transformer-based architecture with separate encoders for language, vision, and cross-modal interactions. It employs a multi-modal pre-training strategy on large datasets like MS COCO and Visual Genome, incorporating tasks such as masked language modeling, object prediction, and visual question answering. The model's design, with dedicated cross-modality layers and shared weights between attention sub-layers, aims to efficiently fuse information from both modalities into a joint representation.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Requires Python 3.
Pre-trained models are available for download.
Fine-tuning requires specific datasets (VQA, GQA, NLVR2) and corresponding Faster R-CNN features, which can be substantial (e.g., 17GB for VQA train features).
Pre-training requires significant computational resources (e.g., 4 GPUs for ~8.5 days).
Official quick-start and detailed fine-tuning instructions for VQA, GQA, and NLVR2 are provided.

Highlighted Details

Achieved top-3 ranking in VQA 2019 and GQA 2019 challenges.
Provides pre-trained models (870MB) and fine-tuning scripts for multiple vision-and-language tasks.
Supports feature extraction via a Docker image for compatibility with the Bottom-Up Attention Caffe implementation.
Model architecture (lxr955) includes 9 language, 5 cross-modality, and 5 object-relationship layers.

Maintenance & Community

The project is associated with Hao Tan and Mohit Bansal. Acknowledgements mention support from ARO-YIP, Google, Facebook, Salesforce, and Adobe. No specific community channels (like Discord/Slack) are listed.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The project acknowledges contributions and code from other sources, including Hugging Face Transformers and Bottom-Up-Attention.

Limitations & Caveats

The README mentions server issues and potential download speed limitations.
Pre-training requires substantial GPU resources and time.
Feature extraction relies on a specific Caffe version of Faster R-CNN, necessitating Docker for ease of use.
The provided pre-trained model is trained for 12 epochs, resulting in slightly lower downstream performance compared to the 20-epoch model mentioned in the paper.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days