Research paper code for self-supervised language-image pre-training
Top 46.1% on sourcepulse
SLIP provides code and pre-trained models for language-image pre-training, aiming to improve upon CLIP baselines. It's designed for researchers and practitioners in computer vision and natural language processing who want to leverage large-scale multimodal datasets for self-supervised learning. The project offers significant performance gains in zero-shot and linear classification tasks.
How It Works
SLIP combines self-supervised learning with language-image pre-training, building on the CLIP architecture. It utilizes a contrastive loss objective, similar to CLIP, but incorporates improvements that lead to enhanced performance. The approach leverages large datasets like YFCC15M and Conceptual Captions to train Vision Transformer (ViT) models of various sizes (Small, Base, Large).
Quick Start & Requirements
timm
are required. Tested with CUDA 11.3, CuDNN 8.2.0, PyTorch 1.10.0, and timm
0.5.0.submitit
or torchrun
.Highlighted Details
Maintenance & Community
The project is released by Facebook AI Research (FAIR). No specific community channels like Discord or Slack are mentioned in the README.
Licensing & Compatibility
MIT License. This license is permissive and allows for commercial use and integration into closed-source projects.
Limitations & Caveats
The setup for datasets, particularly YFCC15M, involves significant data preparation steps. The code has specific version requirements for PyTorch and timm
, and the finetuning section notes a dependency on a specific commit of the BeiT repository.
2 years ago
Inactive