VL-BERT  by jackroos

Visual-linguistic representation for multimodal tasks (ICLR 2020 paper)

Created 5 years ago
744 stars

Top 46.6% on SourcePulse

GitHubView on GitHub
Project Summary

VL-BERT provides an official implementation for pre-training generic visual-linguistic representations, targeting researchers and practitioners in multimodal AI. It enables fine-tuning for tasks like Visual Commonsense Reasoning, Visual Question Answering, and Referring Expression Comprehension, offering a robust foundation for multimodal understanding.

How It Works

VL-BERT employs a Transformer-based architecture, pre-trained on large-scale caption and text-only corpora. This approach allows it to learn rich, joint representations of visual and linguistic information, making it adaptable to various downstream tasks with minimal task-specific modifications. The architecture is designed for efficiency and scalability, supporting distributed training and FP16 mixed-precision.

Quick Start & Requirements

  • Install: Create a conda environment (conda create -n vl-bert python=3.6 pip, conda activate vl-bert), install PyTorch 1.1.0 with CUDA 9.0 (conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch), optionally install Apex for speed-up/FP16, then pip install Cython and pip install -r requirements.txt. Compile with ./scripts/init.sh.
  • Prerequisites: Ubuntu 16.04, CUDA 9.0, GCC 4.9.4, Python 3.6.x, PyTorch 1.0.0 or 1.1.0.
  • Data/Models: Refer to PREPARE_DATA.md and PREPARE_PRETRAINED_MODELS.md.
  • Links: Official Implementation

Highlighted Details

  • Supports distributed training (single/multi-machine), FP16 mixed-precision, gradient accumulation, and TensorboardX monitoring.
  • Includes scripts for training and evaluation across VCR, VQA, and RefCOCO tasks.
  • Codebase benefits from integrations with libraries like transformers, mmdetection, and bottom-up-attention.
  • Visualization code is available.

Maintenance & Community

The project was presented at ICLR 2020. No specific community channels or active maintenance signals are evident in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is presented as an official implementation for a research paper, implying research-focused usage. Commercial use or linking with closed-source projects may require clarification.

Limitations & Caveats

The setup requires specific older versions of CUDA (9.0) and Python (3.6), which may pose compatibility challenges with modern hardware and software stacks. Deadlock issues are noted with distributed training for RefCOCO+, suggesting non-distributed training as a workaround.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.