METER  by zdou0830

Multimodal framework for vision-and-language transformer research

created 3 years ago
373 stars

Top 77.1% on sourcepulse

GitHubView on GitHub
Project Summary

METER is a framework for training end-to-end vision-and-language transformers, offering pre-trained models and fine-tuning scripts for various multimodal tasks. It targets researchers and practitioners in computer vision and natural language processing who need robust multimodal understanding capabilities. The framework provides a unified approach to building and deploying multimodal models, simplifying the process of achieving state-of-the-art results.

How It Works

METER leverages a transformer-based architecture that processes both image and text inputs concurrently. It utilizes pre-trained vision encoders (like CLIP or Swin Transformer) and text encoders (like RoBERTa) and trains them jointly using masked language modeling (MLM) and image-text matching (ITM) objectives. This end-to-end approach allows for deeper fusion of visual and linguistic information, leading to improved performance on downstream tasks.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install -e .
  • Prerequisites: Python, PyTorch, and specific libraries listed in requirements.txt. Pre-trained checkpoints are available for download.
  • Resources: Requires significant GPU resources for pre-training and fine-tuning, with batch sizes dependent on GPU memory.
  • Links: Pre-trained Checkpoints

Highlighted Details

  • Offers 8 pre-trained models, including versions fine-tuned on VQAv2, NLVR2, SNLI-VE, and Flickr30k/COCO for image-text retrieval.
  • Supports various image resolutions (224^2, 288^2, 384^2, 576^2) and encoder combinations (CLIP16, SwinBase with RoBERTa).
  • Provides detailed scripts for pre-training and fine-tuning on multiple downstream tasks, including evaluation.
  • Utilizes pyarrow for dataset serialization, following the ViLT approach.

Maintenance & Community

The project is associated with the CVPR 2022 paper "An Empirical Study of Training End-to-End Vision-and-Language Transformers." The code is based on ViLT (Apache 2.0 license) and incorporates elements from CLIP and Swin-Transformer.

Licensing & Compatibility

The project inherits the Apache 2.0 license from ViLT, with code borrowed from CLIP and Swin-Transformer. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The framework is primarily designed for large-scale distributed training, requiring substantial GPU resources and expertise in setting up distributed environments. The README does not explicitly detail support for single-GPU setups or provide specific performance benchmarks beyond task-specific results.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.