Image BERT pre-training via self-distillation research paper
Top 47.9% on sourcepulse
iBOT is a PyTorch framework for self-supervised pre-training of Vision Transformers (ViTs) and Swin Transformers using masked image modeling and self-distillation. It enables models to learn both global and local semantic features, demonstrating strong performance on downstream tasks like object detection and semantic segmentation, and can extract semantically meaningful image parts.
How It Works
iBOT employs a masked image modeling approach combined with self-distillation. It masks portions of input images and trains the model to predict the masked content. The self-distillation component uses a teacher-student setup, where the teacher model's predictions guide the student's learning, promoting the extraction of robust semantic representations. This dual approach allows iBOT to capture fine-grained local details and broader contextual understanding.
Quick Start & Requirements
run.sh
script for pre-training and fine-tuning.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README implies significant computational resources are needed for pre-training, and specific configurations are detailed for reproducing paper results, suggesting potential complexity in setup for users without prior experience in distributed training or large-scale model pre-training.
3 years ago
1+ week