PyTorch code for cross-modality representation learning via Transformers
Top 39.2% on sourcepulse
This repository provides the PyTorch implementation for LXMERT, a model designed for learning cross-modality encoder representations from Transformers. It is suitable for researchers and practitioners working on vision-and-language tasks such as Visual Question Answering (VQA), GQA, and NLVR2, offering pre-trained models and fine-tuning scripts to achieve state-of-the-art results.
How It Works
LXMERT utilizes a Transformer-based architecture with separate encoders for language, vision, and cross-modal interactions. It employs a multi-modal pre-training strategy on large datasets like MS COCO and Visual Genome, incorporating tasks such as masked language modeling, object prediction, and visual question answering. The model's design, with dedicated cross-modality layers and shared weights between attention sub-layers, aims to efficiently fuse information from both modalities into a joint representation.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
The project is associated with Hao Tan and Mohit Bansal. Acknowledgements mention support from ARO-YIP, Google, Facebook, Salesforce, and Adobe. No specific community channels (like Discord/Slack) are listed.
Licensing & Compatibility
The repository does not explicitly state a license in the README. The project acknowledges contributions and code from other sources, including Hugging Face Transformers and Bottom-Up-Attention.
Limitations & Caveats
2 years ago
Inactive