vilbert-multi-task by facebookresearch

Vision-language representation learning research paper & models

Created 6 years ago

824 stars

Top 43.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

This repository provides code and pre-trained models for multi-task vision and language representation learning, specifically addressing the "12-in-1" approach. It's designed for researchers and practitioners in the vision-language domain looking to leverage a unified model for diverse tasks.

How It Works

The project implements the ViLBERT architecture, which jointly learns representations from visual and textual modalities. It utilizes a multi-task learning framework, pre-training on large datasets like Conceptual Captions and then fine-tuning on a suite of 12 downstream vision-language tasks. This approach aims to create a more robust and generalizable visiolinguistic model.

Quick Start & Requirements

Install: Clone the repo (git clone --recursive), create a conda environment (conda create -n vilbert-mt python=3.6), activate it, install requirements (pip install -r requirements.txt), install PyTorch with CUDA 10.0 (conda install pytorch torchvision cudatoolkit=10.0 -c pytorch), install Apex, and then install the codebase (python setup.py develop).
Prerequisites: Python 3.6, PyTorch with CUDA 10.0, and potentially large datasets for pre-training.
Setup Time: Requires environment setup, dependency installation, and potentially significant time for pre-training or downloading pre-trained models.

Highlighted Details

Supports 12 vision-language tasks within a single model.
Offers pre-trained models for faster adoption.
Implements the ViLBERT architecture for joint vision-language understanding.
Provides scripts for both pre-training and multi-task fine-tuning.

Maintenance & Community

The project originates from Facebook AI Research (FAIR). Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: The MIT license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The provided setup specifies Python 3.6 and CUDA 10.0, which may be outdated. The README does not detail specific hardware requirements beyond CUDA, nor does it offer explicit guidance on migrating to newer PyTorch or CUDA versions.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days