GiT by Haiyang-W

Vision Transformer research paper for generalist models via language interface

Created 1 year ago

358 stars

Top 78.3% on SourcePulse

Project Summary

GiT is an official implementation of a generalist vision transformer that unifies various vision tasks using a plain transformer architecture, similar to LLMs. It targets researchers and practitioners looking to reduce task-specific engineering in computer vision by leveraging a single, adaptable model for object detection, instance segmentation, semantic segmentation, image captioning, and visual grounding, demonstrating strong multi-task and zero/few-shot performance without negative transfer.

How It Works

GiT employs a minimalist architecture consisting solely of a vanilla Vision Transformer (ViT) and a universal language interface. This approach avoids modality-specific encoders and task-specific heads, mirroring the success of large language models. By treating diverse vision tasks as sequence-to-sequence problems solvable by a plain transformer, GiT benefits from task synergy during multi-task training, leading to performance improvements across tasks.

Quick Start & Requirements

Installation: Requires Python 3.8, PyTorch 1.9.1+cu111, and specific versions of openmim, mmengine, mmcv, and transformers. Installation involves cloning the repository, setting up a conda environment, and installing dependencies via pip and mim.
Prerequisites: Pretrained text embeddings from Hugging Face are required. Optional Java installation is needed for image caption evaluation. LVIS API installation is needed for the LVIS dataset.
Resources: Training requires multiple GPUs. Pretrained weights are available for different model sizes (Base, Large, Huge).
Links: arXiv, Hugging Face Embeddings

Highlighted Details

Achieves state-of-the-art performance on multi-tasking, zero-shot, and few-shot benchmarks.
Demonstrates significant performance gains through multi-task training synergy.
Scales well with model size and data, showing strong generalizability.
Supports a wide range of vision tasks including detection, segmentation, and captioning.

Maintenance & Community

The project is associated with ECCV2024 Oral presentation. The codebase is built upon MMDetection and leverages BLIP for text embeddings. Further development plans include engineering optimization, joint training with language, and code refactoring.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The current implementation is noted as having some "dirty" code requiring refactoring. Engineering optimizations for speed are planned but not yet implemented. Joint training with language is also a future development goal.

GiT by Haiyang-W

Explore Similar Projects

VisionReasoner by JIA-Lab-research

mvits_for_class_agnostic_od by mmaaz60

ru-dolph by ai-forever

OV-DINO by wanghao9610

OMG-Seg by lxtGH

VisionLLM by OpenGVLab

Vary by Ucas-HaoranWei

X-Decoder by microsoft

vilbert-multi-task by facebookresearch

MiniMax-01 by MiniMax-AI

open_flamingo by mlfoundations

x-transformers by lucidrains