Vision Transformer research paper for generalist models via language interface
Top 80.6% on sourcepulse
GiT is an official implementation of a generalist vision transformer that unifies various vision tasks using a plain transformer architecture, similar to LLMs. It targets researchers and practitioners looking to reduce task-specific engineering in computer vision by leveraging a single, adaptable model for object detection, instance segmentation, semantic segmentation, image captioning, and visual grounding, demonstrating strong multi-task and zero/few-shot performance without negative transfer.
How It Works
GiT employs a minimalist architecture consisting solely of a vanilla Vision Transformer (ViT) and a universal language interface. This approach avoids modality-specific encoders and task-specific heads, mirroring the success of large language models. By treating diverse vision tasks as sequence-to-sequence problems solvable by a plain transformer, GiT benefits from task synergy during multi-task training, leading to performance improvements across tasks.
Quick Start & Requirements
openmim
, mmengine
, mmcv
, and transformers
. Installation involves cloning the repository, setting up a conda environment, and installing dependencies via pip and mim.Highlighted Details
Maintenance & Community
The project is associated with ECCV2024 Oral presentation. The codebase is built upon MMDetection and leverages BLIP for text embeddings. Further development plans include engineering optimization, joint training with language, and code refactoring.
Licensing & Compatibility
Limitations & Caveats
The current implementation is noted as having some "dirty" code requiring refactoring. Engineering optimizations for speed are planned but not yet implemented. Joint training with language is also a future development goal.
6 months ago
1 day