open_clip by mlfoundations

OpenCLIP: open-source CLIP implementation for vision-language representation learning

Created 4 years ago

13,221 stars

Top 3.8% on SourcePulse

View on GitHub

19 Experts Love This Project

Eric Zhang

Founding Engineer at Modal

John Resig

Author of jQuery; Chief Software Architect at Khan Academy

and 15 more!

Project Summary

This repository provides an open-source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training) model, enabling researchers and developers to train and utilize powerful vision-language models. It offers a comprehensive suite of tools for training, fine-tuning, and evaluating CLIP-style models on large datasets, with pre-trained models achieving state-of-the-art zero-shot accuracy on benchmarks like ImageNet.

How It Works

OpenCLIP implements the contrastive language-image pre-training objective, learning to align image and text embeddings. It supports various vision backbones (e.g., ViT, ConvNeXt, SigLIP) and text encoders, allowing for flexible model architectures. The codebase is optimized for large-scale distributed training, featuring efficient data loading (WebDataset), gradient accumulation, and mixed-precision training.

Quick Start & Requirements

Install: pip install open_clip_torch
Requirements: PyTorch, timm (latest recommended), transformers (if using transformer tokenizers). GPU with CUDA is highly recommended for training and inference.
Usage example and pretrained model loading details are available in the README.

Highlighted Details

Offers a wide range of pre-trained models with varying architectures and training datasets (LAION-2B, DataComp-1B), achieving high zero-shot ImageNet accuracy.
Supports training of CoCa (Contrastive Captioner) models for generative tasks.
Includes robust distributed training capabilities, tested up to 1024 A100 GPUs, with native SLURM support.
Features advanced training techniques like patch dropout for faster training and int8 support for inference/training speedups.

Maintenance & Community

Led by prominent researchers in the field (Ross Wightman, Romain Beaumont, etc.).
Acknowledges contributions from various institutions and individuals.
Encourages community contributions via issues and pull requests.

Licensing & Compatibility

The repository is primarily licensed under the MIT License, allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

Some older checkpoints may use QuickGELU, which is less efficient than native torch.nn.GELU; newer models default to nn.GELU.
Beta support for int8 training is available, with potential for further optimization.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

136 stars in the last 30 days