vilbert-multi-task  by facebookresearch

Vision-language representation learning research paper & models

created 5 years ago
815 stars

Top 44.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code and pre-trained models for multi-task vision and language representation learning, specifically addressing the "12-in-1" approach. It's designed for researchers and practitioners in the vision-language domain looking to leverage a unified model for diverse tasks.

How It Works

The project implements the ViLBERT architecture, which jointly learns representations from visual and textual modalities. It utilizes a multi-task learning framework, pre-training on large datasets like Conceptual Captions and then fine-tuning on a suite of 12 downstream vision-language tasks. This approach aims to create a more robust and generalizable visiolinguistic model.

Quick Start & Requirements

  • Install: Clone the repo (git clone --recursive), create a conda environment (conda create -n vilbert-mt python=3.6), activate it, install requirements (pip install -r requirements.txt), install PyTorch with CUDA 10.0 (conda install pytorch torchvision cudatoolkit=10.0 -c pytorch), install Apex, and then install the codebase (python setup.py develop).
  • Prerequisites: Python 3.6, PyTorch with CUDA 10.0, and potentially large datasets for pre-training.
  • Setup Time: Requires environment setup, dependency installation, and potentially significant time for pre-training or downloading pre-trained models.

Highlighted Details

  • Supports 12 vision-language tasks within a single model.
  • Offers pre-trained models for faster adoption.
  • Implements the ViLBERT architecture for joint vision-language understanding.
  • Provides scripts for both pre-training and multi-task fine-tuning.

Maintenance & Community

The project originates from Facebook AI Research (FAIR). Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The MIT license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The provided setup specifies Python 3.6 and CUDA 10.0, which may be outdated. The README does not detail specific hardware requirements beyond CUDA, nor does it offer explicit guidance on migrating to newer PyTorch or CUDA versions.

Health Check
Last commit

3 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.