METER  by zdou0830

Multimodal framework for vision-and-language transformer research

Created 3 years ago
373 stars

Top 76.0% on SourcePulse

GitHubView on GitHub
Project Summary

METER is a framework for training end-to-end vision-and-language transformers, offering pre-trained models and fine-tuning scripts for various multimodal tasks. It targets researchers and practitioners in computer vision and natural language processing who need robust multimodal understanding capabilities. The framework provides a unified approach to building and deploying multimodal models, simplifying the process of achieving state-of-the-art results.

How It Works

METER leverages a transformer-based architecture that processes both image and text inputs concurrently. It utilizes pre-trained vision encoders (like CLIP or Swin Transformer) and text encoders (like RoBERTa) and trains them jointly using masked language modeling (MLM) and image-text matching (ITM) objectives. This end-to-end approach allows for deeper fusion of visual and linguistic information, leading to improved performance on downstream tasks.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install -e .
  • Prerequisites: Python, PyTorch, and specific libraries listed in requirements.txt. Pre-trained checkpoints are available for download.
  • Resources: Requires significant GPU resources for pre-training and fine-tuning, with batch sizes dependent on GPU memory.
  • Links: Pre-trained Checkpoints

Highlighted Details

  • Offers 8 pre-trained models, including versions fine-tuned on VQAv2, NLVR2, SNLI-VE, and Flickr30k/COCO for image-text retrieval.
  • Supports various image resolutions (224^2, 288^2, 384^2, 576^2) and encoder combinations (CLIP16, SwinBase with RoBERTa).
  • Provides detailed scripts for pre-training and fine-tuning on multiple downstream tasks, including evaluation.
  • Utilizes pyarrow for dataset serialization, following the ViLT approach.

Maintenance & Community

The project is associated with the CVPR 2022 paper "An Empirical Study of Training End-to-End Vision-and-Language Transformers." The code is based on ViLT (Apache 2.0 license) and incorporates elements from CLIP and Swin-Transformer.

Licensing & Compatibility

The project inherits the Apache 2.0 license from ViLT, with code borrowed from CLIP and Swin-Transformer. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The framework is primarily designed for large-scale distributed training, requiring substantial GPU resources and expertise in setting up distributed environments. The README does not explicitly detail support for single-GPU setups or provide specific performance benchmarks beyond task-specific results.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.