X-VLM  by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

created 3 years ago
482 stars

Top 64.5% on sourcepulse

GitHubView on GitHub
Project Summary

X-VLM is a multi-grained vision-language pre-training framework designed to align text with visual concepts across images and videos. It targets researchers and practitioners in computer vision and natural language processing, offering a unified approach for various vision-language tasks.

How It Works

X-VLM employs a multi-grained alignment strategy, enabling it to capture fine-grained relationships between text and visual elements. It supports flexible backbone choices for both vision (DeiT, CLIP-ViT, Swin-Transformer) and text (BERT, RoBERTa) encoders, allowing for customization and leveraging state-of-the-art models. The framework is optimized for distributed training using Apex O1/O2 and can read/write from HDFS.

Quick Start & Requirements

  • Install: pip3 install -r requirements.txt
  • Prerequisites: Python 3 environment, Apex (for pre-training), pre-trained vision and text encoder models (e.g., clip-vit-base, swin-transformer-base, bert-base-uncased), and pre-training datasets (COCO, VG, SBU, CC3M).
  • Setup: Requires downloading datasets and organizing them according to the specified directory structure. Pre-training can be initiated with python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base".
  • Links: Official PyTorch implementation

Highlighted Details

  • Supports image, video, and cross-lingual/domain transfer capabilities.
  • Offers pre-trained checkpoints for 4M and 16M parameter models.
  • Provides fine-tuning scripts for tasks like retrieval, VQA, NLVR2, and captioning.
  • Includes evaluation scripts for the VLUE benchmark.

Maintenance & Community

The project was released by ByteDance AI-LAB. Contact is available via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or closed-source integration.

Limitations & Caveats

The README does not specify a license, which may pose a barrier to commercial adoption. Users are responsible for preparing their own pre-training data as the project cannot redistribute it.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.