Multimodal framework for vision-and-language transformer research
Top 77.1% on sourcepulse
METER is a framework for training end-to-end vision-and-language transformers, offering pre-trained models and fine-tuning scripts for various multimodal tasks. It targets researchers and practitioners in computer vision and natural language processing who need robust multimodal understanding capabilities. The framework provides a unified approach to building and deploying multimodal models, simplifying the process of achieving state-of-the-art results.
How It Works
METER leverages a transformer-based architecture that processes both image and text inputs concurrently. It utilizes pre-trained vision encoders (like CLIP or Swin Transformer) and text encoders (like RoBERTa) and trains them jointly using masked language modeling (MLM) and image-text matching (ITM) objectives. This end-to-end approach allows for deeper fusion of visual and linguistic information, leading to improved performance on downstream tasks.
Quick Start & Requirements
pip install -r requirements.txt
followed by pip install -e .
requirements.txt
. Pre-trained checkpoints are available for download.Highlighted Details
Maintenance & Community
The project is associated with the CVPR 2022 paper "An Empirical Study of Training End-to-End Vision-and-Language Transformers." The code is based on ViLT (Apache 2.0 license) and incorporates elements from CLIP and Swin-Transformer.
Licensing & Compatibility
The project inherits the Apache 2.0 license from ViLT, with code borrowed from CLIP and Swin-Transformer. This license is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
The framework is primarily designed for large-scale distributed training, requiring substantial GPU resources and expertise in setting up distributed environments. The README does not explicitly detail support for single-GPU setups or provide specific performance benchmarks beyond task-specific results.
2 years ago
Inactive