X-VLM  by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 3 years ago
484 stars

Top 63.4% on SourcePulse

GitHubView on GitHub
Project Summary

X-VLM is a multi-grained vision-language pre-training framework designed to align text with visual concepts across images and videos. It targets researchers and practitioners in computer vision and natural language processing, offering a unified approach for various vision-language tasks.

How It Works

X-VLM employs a multi-grained alignment strategy, enabling it to capture fine-grained relationships between text and visual elements. It supports flexible backbone choices for both vision (DeiT, CLIP-ViT, Swin-Transformer) and text (BERT, RoBERTa) encoders, allowing for customization and leveraging state-of-the-art models. The framework is optimized for distributed training using Apex O1/O2 and can read/write from HDFS.

Quick Start & Requirements

  • Install: pip3 install -r requirements.txt
  • Prerequisites: Python 3 environment, Apex (for pre-training), pre-trained vision and text encoder models (e.g., clip-vit-base, swin-transformer-base, bert-base-uncased), and pre-training datasets (COCO, VG, SBU, CC3M).
  • Setup: Requires downloading datasets and organizing them according to the specified directory structure. Pre-training can be initiated with python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base".
  • Links: Official PyTorch implementation

Highlighted Details

  • Supports image, video, and cross-lingual/domain transfer capabilities.
  • Offers pre-trained checkpoints for 4M and 16M parameter models.
  • Provides fine-tuning scripts for tasks like retrieval, VQA, NLVR2, and captioning.
  • Includes evaluation scripts for the VLUE benchmark.

Maintenance & Community

The project was released by ByteDance AI-LAB. Contact is available via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or closed-source integration.

Limitations & Caveats

The README does not specify a license, which may pose a barrier to commercial adoption. Users are responsible for preparing their own pre-training data as the project cannot redistribute it.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.