X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

488 stars

Top 63.2% on SourcePulse

View on GitHub

2 Experts Love This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

X-VLM is a multi-grained vision-language pre-training framework designed to align text with visual concepts across images and videos. It targets researchers and practitioners in computer vision and natural language processing, offering a unified approach for various vision-language tasks.

How It Works

X-VLM employs a multi-grained alignment strategy, enabling it to capture fine-grained relationships between text and visual elements. It supports flexible backbone choices for both vision (DeiT, CLIP-ViT, Swin-Transformer) and text (BERT, RoBERTa) encoders, allowing for customization and leveraging state-of-the-art models. The framework is optimized for distributed training using Apex O1/O2 and can read/write from HDFS.

Quick Start & Requirements

Install: pip3 install -r requirements.txt
Prerequisites: Python 3 environment, Apex (for pre-training), pre-trained vision and text encoder models (e.g., clip-vit-base, swin-transformer-base, bert-base-uncased), and pre-training datasets (COCO, VG, SBU, CC3M).
Setup: Requires downloading datasets and organizing them according to the specified directory structure. Pre-training can be initiated with python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base".
Links: Official PyTorch implementation

Highlighted Details

Supports image, video, and cross-lingual/domain transfer capabilities.
Offers pre-trained checkpoints for 4M and 16M parameter models.
Provides fine-tuning scripts for tasks like retrieval, VQA, NLVR2, and captioning.
Includes evaluation scripts for the VLUE benchmark.

Maintenance & Community

The project was released by ByteDance AI-LAB. Contact is available via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or closed-source integration.

Limitations & Caveats

The README does not specify a license, which may pose a barrier to commercial adoption. Users are responsible for preparing their own pre-training data as the project cannot redistribute it.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days