Vision-language representation learning research paper & models
Top 44.4% on sourcepulse
This repository provides code and pre-trained models for multi-task vision and language representation learning, specifically addressing the "12-in-1" approach. It's designed for researchers and practitioners in the vision-language domain looking to leverage a unified model for diverse tasks.
How It Works
The project implements the ViLBERT architecture, which jointly learns representations from visual and textual modalities. It utilizes a multi-task learning framework, pre-training on large datasets like Conceptual Captions and then fine-tuning on a suite of 12 downstream vision-language tasks. This approach aims to create a more robust and generalizable visiolinguistic model.
Quick Start & Requirements
git clone --recursive
), create a conda environment (conda create -n vilbert-mt python=3.6
), activate it, install requirements (pip install -r requirements.txt
), install PyTorch with CUDA 10.0 (conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
), install Apex, and then install the codebase (python setup.py develop
).Highlighted Details
Maintenance & Community
The project originates from Facebook AI Research (FAIR). Specific community channels or active maintenance status are not detailed in the README.
Licensing & Compatibility
Limitations & Caveats
The provided setup specifies Python 3.6 and CUDA 10.0, which may be outdated. The README does not detail specific hardware requirements beyond CUDA, nor does it offer explicit guidance on migrating to newer PyTorch or CUDA versions.
3 years ago
1+ week