SLIP by facebookresearch

Research paper code for self-supervised language-image pre-training

Created 4 years ago

787 stars

Top 44.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Chenlin Meng

Cofounder of Pika

Ross Wightman

Author of timm; CV at Hugging Face

Saining Xie

Professor at NYU

Project Summary

SLIP provides code and pre-trained models for language-image pre-training, aiming to improve upon CLIP baselines. It's designed for researchers and practitioners in computer vision and natural language processing who want to leverage large-scale multimodal datasets for self-supervised learning. The project offers significant performance gains in zero-shot and linear classification tasks.

How It Works

SLIP combines self-supervised learning with language-image pre-training, building on the CLIP architecture. It utilizes a contrastive loss objective, similar to CLIP, but incorporates improvements that lead to enhanced performance. The approach leverages large datasets like YFCC15M and Conceptual Captions to train Vision Transformer (ViT) models of various sizes (Small, Base, Large).

Quick Start & Requirements

Install: PyTorch and timm are required. Tested with CUDA 11.3, CuDNN 8.2.0, PyTorch 1.10.0, and timm 0.5.0.
Data: Requires downloading and structuring large datasets like YFCC100M, COCO, CC3M/CC12M, or RedCaps. Specific metadata preparation is needed for each.
Training: Supports distributed training via submitit or torchrun.
Evaluation: Scripts for zero-shot transfer and linear classification are provided.
Links: Official Paper

Highlighted Details

Achieves state-of-the-art results on ImageNet zero-shot classification, outperforming CLIP.
Provides pre-trained weights for ViT-Small, ViT-Base, and ViT-Large models.
Includes evaluation scripts for 26 downstream datasets.
Supports training on multiple large-scale datasets including YFCC15M, CC3M, CC12M, COCO, and RedCaps.

Maintenance & Community

The project is released by Facebook AI Research (FAIR). No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

MIT License. This license is permissive and allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The setup for datasets, particularly YFCC15M, involves significant data preparation steps. The code has specific version requirements for PyTorch and timm, and the finetuning section notes a dependency on a specific commit of the BeiT repository.

Health Check

Last Commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days