SLIP  by facebookresearch

Research paper code for self-supervised language-image pre-training

created 3 years ago
773 stars

Top 46.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SLIP provides code and pre-trained models for language-image pre-training, aiming to improve upon CLIP baselines. It's designed for researchers and practitioners in computer vision and natural language processing who want to leverage large-scale multimodal datasets for self-supervised learning. The project offers significant performance gains in zero-shot and linear classification tasks.

How It Works

SLIP combines self-supervised learning with language-image pre-training, building on the CLIP architecture. It utilizes a contrastive loss objective, similar to CLIP, but incorporates improvements that lead to enhanced performance. The approach leverages large datasets like YFCC15M and Conceptual Captions to train Vision Transformer (ViT) models of various sizes (Small, Base, Large).

Quick Start & Requirements

  • Install: PyTorch and timm are required. Tested with CUDA 11.3, CuDNN 8.2.0, PyTorch 1.10.0, and timm 0.5.0.
  • Data: Requires downloading and structuring large datasets like YFCC100M, COCO, CC3M/CC12M, or RedCaps. Specific metadata preparation is needed for each.
  • Training: Supports distributed training via submitit or torchrun.
  • Evaluation: Scripts for zero-shot transfer and linear classification are provided.
  • Links: Official Paper

Highlighted Details

  • Achieves state-of-the-art results on ImageNet zero-shot classification, outperforming CLIP.
  • Provides pre-trained weights for ViT-Small, ViT-Base, and ViT-Large models.
  • Includes evaluation scripts for 26 downstream datasets.
  • Supports training on multiple large-scale datasets including YFCC15M, CC3M, CC12M, COCO, and RedCaps.

Maintenance & Community

The project is released by Facebook AI Research (FAIR). No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

MIT License. This license is permissive and allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The setup for datasets, particularly YFCC15M, involves significant data preparation steps. The code has specific version requirements for PyTorch and timm, and the finetuning section notes a dependency on a specific commit of the BeiT repository.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.