Visual-linguistic representation for multimodal tasks (ICLR 2020 paper)
Top 47.6% on sourcepulse
VL-BERT provides an official implementation for pre-training generic visual-linguistic representations, targeting researchers and practitioners in multimodal AI. It enables fine-tuning for tasks like Visual Commonsense Reasoning, Visual Question Answering, and Referring Expression Comprehension, offering a robust foundation for multimodal understanding.
How It Works
VL-BERT employs a Transformer-based architecture, pre-trained on large-scale caption and text-only corpora. This approach allows it to learn rich, joint representations of visual and linguistic information, making it adaptable to various downstream tasks with minimal task-specific modifications. The architecture is designed for efficiency and scalability, supporting distributed training and FP16 mixed-precision.
Quick Start & Requirements
conda create -n vl-bert python=3.6 pip
, conda activate vl-bert
), install PyTorch 1.1.0 with CUDA 9.0 (conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch
), optionally install Apex for speed-up/FP16, then pip install Cython
and pip install -r requirements.txt
. Compile with ./scripts/init.sh
.PREPARE_DATA.md
and PREPARE_PRETRAINED_MODELS.md
.Highlighted Details
Maintenance & Community
The project was presented at ICLR 2020. No specific community channels or active maintenance signals are evident in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The code is presented as an official implementation for a research paper, implying research-focused usage. Commercial use or linking with closed-source projects may require clarification.
Limitations & Caveats
The setup requires specific older versions of CUDA (9.0) and Python (3.6), which may pose compatibility challenges with modern hardware and software stacks. Deadlock issues are noted with distributed training for RefCOCO+, suggesting non-distributed training as a workaround.
2 years ago
Inactive