Vision-language model for multi-grained alignment (ICML 2022 paper)
Top 64.5% on sourcepulse
X-VLM is a multi-grained vision-language pre-training framework designed to align text with visual concepts across images and videos. It targets researchers and practitioners in computer vision and natural language processing, offering a unified approach for various vision-language tasks.
How It Works
X-VLM employs a multi-grained alignment strategy, enabling it to capture fine-grained relationships between text and visual elements. It supports flexible backbone choices for both vision (DeiT, CLIP-ViT, Swin-Transformer) and text (BERT, RoBERTa) encoders, allowing for customization and leveraging state-of-the-art models. The framework is optimized for distributed training using Apex O1/O2 and can read/write from HDFS.
Quick Start & Requirements
pip3 install -r requirements.txt
clip-vit-base
, swin-transformer-base
, bert-base-uncased
), and pre-training datasets (COCO, VG, SBU, CC3M).python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"
.Highlighted Details
Maintenance & Community
The project was released by ByteDance AI-LAB. Contact is available via GitHub issues.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or closed-source integration.
Limitations & Caveats
The README does not specify a license, which may pose a barrier to commercial adoption. Users are responsible for preparing their own pre-training data as the project cannot redistribute it.
2 years ago
1 week