ibot by bytedance

Image BERT pre-training via self-distillation research paper

Created 4 years ago

759 stars

Top 46.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

iBOT is a PyTorch framework for self-supervised pre-training of Vision Transformers (ViTs) and Swin Transformers using masked image modeling and self-distillation. It enables models to learn both global and local semantic features, demonstrating strong performance on downstream tasks like object detection and semantic segmentation, and can extract semantically meaningful image parts.

How It Works

iBOT employs a masked image modeling approach combined with self-distillation. It masks portions of input images and trains the model to predict the masked content. The self-distillation component uses a teacher-student setup, where the teacher model's predictions guide the student's learning, promoting the extraction of robust semantic representations. This dual approach allows iBOT to capture fine-grained local details and broader contextual understanding.

Quick Start & Requirements

Install/Run: Use the provided run.sh script for pre-training and fine-tuning.
Prerequisites: PyTorch, Python. Specific GPU requirements (e.g., multiple GPUs for distributed training) and datasets (e.g., ImageNet) are implied by the examples.
Resources: Training can be resource-intensive, requiring multiple nodes and GPUs for optimal performance as shown in examples (e.g., 2 nodes, 8 GPUs).
Links: arXiv, Colab

Highlighted Details

Achieves 81.0% linear probing accuracy with ViT-L/16 on ImageNet-1K.
Supports both ViT and Swin Transformer architectures.
Provides pre-trained models for various configurations (ViT-S/16, ViT-B/16, ViT-L/16, Swin-T) with performance benchmarks.
Includes utilities for analyzing learned properties, such as extracting semantic patterns and correspondences.

Maintenance & Community

The project is from ByteDance.
The code is based on DINO and BEiT repositories.
No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The README implies significant computational resources are needed for pre-training, and specific configurations are detailed for reproducing paper results, suggesting potential complexity in setup for users without prior experience in distributed training or large-scale model pre-training.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days