ibot  by bytedance

Image BERT pre-training via self-distillation research paper

created 3 years ago
739 stars

Top 47.9% on sourcepulse

GitHubView on GitHub
Project Summary

iBOT is a PyTorch framework for self-supervised pre-training of Vision Transformers (ViTs) and Swin Transformers using masked image modeling and self-distillation. It enables models to learn both global and local semantic features, demonstrating strong performance on downstream tasks like object detection and semantic segmentation, and can extract semantically meaningful image parts.

How It Works

iBOT employs a masked image modeling approach combined with self-distillation. It masks portions of input images and trains the model to predict the masked content. The self-distillation component uses a teacher-student setup, where the teacher model's predictions guide the student's learning, promoting the extraction of robust semantic representations. This dual approach allows iBOT to capture fine-grained local details and broader contextual understanding.

Quick Start & Requirements

  • Install/Run: Use the provided run.sh script for pre-training and fine-tuning.
  • Prerequisites: PyTorch, Python. Specific GPU requirements (e.g., multiple GPUs for distributed training) and datasets (e.g., ImageNet) are implied by the examples.
  • Resources: Training can be resource-intensive, requiring multiple nodes and GPUs for optimal performance as shown in examples (e.g., 2 nodes, 8 GPUs).
  • Links: arXiv, Colab

Highlighted Details

  • Achieves 81.0% linear probing accuracy with ViT-L/16 on ImageNet-1K.
  • Supports both ViT and Swin Transformer architectures.
  • Provides pre-trained models for various configurations (ViT-S/16, ViT-B/16, ViT-L/16, Swin-T) with performance benchmarks.
  • Includes utilities for analyzing learned properties, such as extracting semantic patterns and correspondences.

Maintenance & Community

  • The project is from ByteDance.
  • The code is based on DINO and BEiT repositories.
  • No explicit community links (Discord/Slack) are provided in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The README implies significant computational resources are needed for pre-training, and specific configurations are detailed for reproducing paper results, suggesting potential complexity in setup for users without prior experience in distributed training or large-scale model pre-training.

Health Check
Last commit

3 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Phil Wang Phil Wang(Prolific Research Paper Implementer), and
4 more.

vit-pytorch by lucidrains

0.2%
24k
PyTorch library for Vision Transformer variants and related techniques
created 4 years ago
updated 6 days ago
Feedback? Help us improve.