GiT  by Haiyang-W

Vision Transformer research paper for generalist models via language interface

created 1 year ago
350 stars

Top 80.6% on sourcepulse

GitHubView on GitHub
Project Summary

GiT is an official implementation of a generalist vision transformer that unifies various vision tasks using a plain transformer architecture, similar to LLMs. It targets researchers and practitioners looking to reduce task-specific engineering in computer vision by leveraging a single, adaptable model for object detection, instance segmentation, semantic segmentation, image captioning, and visual grounding, demonstrating strong multi-task and zero/few-shot performance without negative transfer.

How It Works

GiT employs a minimalist architecture consisting solely of a vanilla Vision Transformer (ViT) and a universal language interface. This approach avoids modality-specific encoders and task-specific heads, mirroring the success of large language models. By treating diverse vision tasks as sequence-to-sequence problems solvable by a plain transformer, GiT benefits from task synergy during multi-task training, leading to performance improvements across tasks.

Quick Start & Requirements

  • Installation: Requires Python 3.8, PyTorch 1.9.1+cu111, and specific versions of openmim, mmengine, mmcv, and transformers. Installation involves cloning the repository, setting up a conda environment, and installing dependencies via pip and mim.
  • Prerequisites: Pretrained text embeddings from Hugging Face are required. Optional Java installation is needed for image caption evaluation. LVIS API installation is needed for the LVIS dataset.
  • Resources: Training requires multiple GPUs. Pretrained weights are available for different model sizes (Base, Large, Huge).
  • Links: arXiv, Hugging Face Embeddings

Highlighted Details

  • Achieves state-of-the-art performance on multi-tasking, zero-shot, and few-shot benchmarks.
  • Demonstrates significant performance gains through multi-task training synergy.
  • Scales well with model size and data, showing strong generalizability.
  • Supports a wide range of vision tasks including detection, segmentation, and captioning.

Maintenance & Community

The project is associated with ECCV2024 Oral presentation. The codebase is built upon MMDetection and leverages BLIP for text embeddings. Further development plans include engineering optimization, joint training with language, and code refactoring.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The current implementation is noted as having some "dirty" code requiring refactoring. Engineering optimizations for speed are planned but not yet implemented. Joint training with language is also a future development goal.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.