open_clip  by mlfoundations

OpenCLIP: open-source CLIP implementation for vision-language representation learning

Created 4 years ago
12,876 stars

Top 3.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training) model, enabling researchers and developers to train and utilize powerful vision-language models. It offers a comprehensive suite of tools for training, fine-tuning, and evaluating CLIP-style models on large datasets, with pre-trained models achieving state-of-the-art zero-shot accuracy on benchmarks like ImageNet.

How It Works

OpenCLIP implements the contrastive language-image pre-training objective, learning to align image and text embeddings. It supports various vision backbones (e.g., ViT, ConvNeXt, SigLIP) and text encoders, allowing for flexible model architectures. The codebase is optimized for large-scale distributed training, featuring efficient data loading (WebDataset), gradient accumulation, and mixed-precision training.

Quick Start & Requirements

  • Install: pip install open_clip_torch
  • Requirements: PyTorch, timm (latest recommended), transformers (if using transformer tokenizers). GPU with CUDA is highly recommended for training and inference.
  • Usage example and pretrained model loading details are available in the README.

Highlighted Details

  • Offers a wide range of pre-trained models with varying architectures and training datasets (LAION-2B, DataComp-1B), achieving high zero-shot ImageNet accuracy.
  • Supports training of CoCa (Contrastive Captioner) models for generative tasks.
  • Includes robust distributed training capabilities, tested up to 1024 A100 GPUs, with native SLURM support.
  • Features advanced training techniques like patch dropout for faster training and int8 support for inference/training speedups.

Maintenance & Community

  • Led by prominent researchers in the field (Ross Wightman, Romain Beaumont, etc.).
  • Acknowledges contributions from various institutions and individuals.
  • Encourages community contributions via issues and pull requests.

Licensing & Compatibility

  • The repository is primarily licensed under the MIT License, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Some older checkpoints may use QuickGELU, which is less efficient than native torch.nn.GELU; newer models default to nn.GELU.
  • Beta support for int8 training is available, with potential for further optimization.
Health Check
Last Commit

16 hours ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
4
Star History
194 stars in the last 30 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

Otter by EvolvingLMMs-Lab

0.0%
3k
Multimodal model for improved instruction following and in-context learning
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.