VL-BERT by jackroos

Visual-linguistic representation for multimodal tasks (ICLR 2020 paper)

Created 6 years ago

746 stars

Top 46.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

VL-BERT provides an official implementation for pre-training generic visual-linguistic representations, targeting researchers and practitioners in multimodal AI. It enables fine-tuning for tasks like Visual Commonsense Reasoning, Visual Question Answering, and Referring Expression Comprehension, offering a robust foundation for multimodal understanding.

How It Works

VL-BERT employs a Transformer-based architecture, pre-trained on large-scale caption and text-only corpora. This approach allows it to learn rich, joint representations of visual and linguistic information, making it adaptable to various downstream tasks with minimal task-specific modifications. The architecture is designed for efficiency and scalability, supporting distributed training and FP16 mixed-precision.

Quick Start & Requirements

Install: Create a conda environment (conda create -n vl-bert python=3.6 pip, conda activate vl-bert), install PyTorch 1.1.0 with CUDA 9.0 (conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch), optionally install Apex for speed-up/FP16, then pip install Cython and pip install -r requirements.txt. Compile with ./scripts/init.sh.
Prerequisites: Ubuntu 16.04, CUDA 9.0, GCC 4.9.4, Python 3.6.x, PyTorch 1.0.0 or 1.1.0.
Data/Models: Refer to PREPARE_DATA.md and PREPARE_PRETRAINED_MODELS.md.
Links: Official Implementation

Highlighted Details

Supports distributed training (single/multi-machine), FP16 mixed-precision, gradient accumulation, and TensorboardX monitoring.
Includes scripts for training and evaluation across VCR, VQA, and RefCOCO tasks.
Codebase benefits from integrations with libraries like transformers, mmdetection, and bottom-up-attention.
Visualization code is available.

Maintenance & Community

The project was presented at ICLR 2020. No specific community channels or active maintenance signals are evident in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is presented as an official implementation for a research paper, implying research-focused usage. Commercial use or linking with closed-source projects may require clarification.

Limitations & Caveats

The setup requires specific older versions of CUDA (9.0) and Python (3.6), which may pose compatibility challenges with modern hardware and software stacks. Deadlock issues are noted with distributed training for RefCOCO+, suggesting non-distributed training as a workaround.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days