METER by zdou0830

Multimodal framework for vision-and-language transformer research

Created 4 years ago

375 stars

Top 75.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

METER is a framework for training end-to-end vision-and-language transformers, offering pre-trained models and fine-tuning scripts for various multimodal tasks. It targets researchers and practitioners in computer vision and natural language processing who need robust multimodal understanding capabilities. The framework provides a unified approach to building and deploying multimodal models, simplifying the process of achieving state-of-the-art results.

How It Works

METER leverages a transformer-based architecture that processes both image and text inputs concurrently. It utilizes pre-trained vision encoders (like CLIP or Swin Transformer) and text encoders (like RoBERTa) and trains them jointly using masked language modeling (MLM) and image-text matching (ITM) objectives. This end-to-end approach allows for deeper fusion of visual and linguistic information, leading to improved performance on downstream tasks.

Quick Start & Requirements

Install: pip install -r requirements.txt followed by pip install -e .
Prerequisites: Python, PyTorch, and specific libraries listed in requirements.txt. Pre-trained checkpoints are available for download.
Resources: Requires significant GPU resources for pre-training and fine-tuning, with batch sizes dependent on GPU memory.
Links: Pre-trained Checkpoints

Highlighted Details

Offers 8 pre-trained models, including versions fine-tuned on VQAv2, NLVR2, SNLI-VE, and Flickr30k/COCO for image-text retrieval.
Supports various image resolutions (224^2, 288^2, 384^2, 576^2) and encoder combinations (CLIP16, SwinBase with RoBERTa).
Provides detailed scripts for pre-training and fine-tuning on multiple downstream tasks, including evaluation.
Utilizes pyarrow for dataset serialization, following the ViLT approach.

Maintenance & Community

The project is associated with the CVPR 2022 paper "An Empirical Study of Training End-to-End Vision-and-Language Transformers." The code is based on ViLT (Apache 2.0 license) and incorporates elements from CLIP and Swin-Transformer.

Licensing & Compatibility

The project inherits the Apache 2.0 license from ViLT, with code borrowed from CLIP and Swin-Transformer. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The framework is primarily designed for large-scale distributed training, requiring substantial GPU resources and expertise in setting up distributed environments. The README does not explicitly detail support for single-GPU setups or provide specific performance benchmarks beyond task-specific results.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days