vilbert-multi-task  by facebookresearch

Vision-language representation learning research paper & models

Created 5 years ago
818 stars

Top 43.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code and pre-trained models for multi-task vision and language representation learning, specifically addressing the "12-in-1" approach. It's designed for researchers and practitioners in the vision-language domain looking to leverage a unified model for diverse tasks.

How It Works

The project implements the ViLBERT architecture, which jointly learns representations from visual and textual modalities. It utilizes a multi-task learning framework, pre-training on large datasets like Conceptual Captions and then fine-tuning on a suite of 12 downstream vision-language tasks. This approach aims to create a more robust and generalizable visiolinguistic model.

Quick Start & Requirements

  • Install: Clone the repo (git clone --recursive), create a conda environment (conda create -n vilbert-mt python=3.6), activate it, install requirements (pip install -r requirements.txt), install PyTorch with CUDA 10.0 (conda install pytorch torchvision cudatoolkit=10.0 -c pytorch), install Apex, and then install the codebase (python setup.py develop).
  • Prerequisites: Python 3.6, PyTorch with CUDA 10.0, and potentially large datasets for pre-training.
  • Setup Time: Requires environment setup, dependency installation, and potentially significant time for pre-training or downloading pre-trained models.

Highlighted Details

  • Supports 12 vision-language tasks within a single model.
  • Offers pre-trained models for faster adoption.
  • Implements the ViLBERT architecture for joint vision-language understanding.
  • Provides scripts for both pre-training and multi-task fine-tuning.

Maintenance & Community

The project originates from Facebook AI Research (FAIR). Specific community channels or active maintenance status are not detailed in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The MIT license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

The provided setup specifies Python 3.6 and CUDA 10.0, which may be outdated. The README does not detail specific hardware requirements beyond CUDA, nor does it offer explicit guidance on migrating to newer PyTorch or CUDA versions.

Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Forrest Iandola Forrest Iandola(Author of SqueezeNet; Research Scientist at Meta), Chris Van Pelt Chris Van Pelt(Cofounder of Weights & Biases), and
2 more.

mt-dnn by namisan

0%
2k
PyTorch package for multi-task deep neural networks research
Created 6 years ago
Updated 1 year ago
Feedback? Help us improve.