CLIP by openai

Image-text matching model for zero-shot prediction

Created 5 years ago

32,228 stars

Top 1.1% on SourcePulse

View on GitHub

33 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Travis Fischer

Founder of Agentic

Edward Sun

Research Scientist at Meta Superintelligence Lab

Matt Holt

Author of Caddy

and 29 more!

Project Summary

CLIP (Contrastive Language-Image Pre-Training) enables zero-shot image classification by learning from image-text pairs. It allows users to predict the most relevant text snippet for a given image without task-specific fine-tuning, achieving performance comparable to traditional supervised methods on benchmarks like ImageNet. This is particularly useful for researchers and developers needing flexible image understanding capabilities.

How It Works

CLIP employs a transformer-based architecture trained on a massive dataset of image-text pairs. It learns to embed images and text into a shared multimodal embedding space. By calculating the cosine similarity between image and text embeddings, CLIP can determine the relevance of text descriptions to an image, effectively performing zero-shot classification. This approach bypasses the need for labeled datasets for new tasks.

Quick Start & Requirements

Install: pip install git+https://github.com/openai/CLIP.git
Prerequisites: PyTorch 1.7.1+ and torchvision. CUDA 11.0+ recommended for GPU acceleration.
Setup: Minimal setup time, primarily involves installing dependencies and downloading model weights.
Docs: Blog, Paper, Model Card, Colab

Highlighted Details

Achieves zero-shot performance on ImageNet without using labeled examples.
Supports multiple model variants (e.g., ViT-B/32).
Provides methods for encoding images, encoding text, and direct model inference.
Includes examples for zero-shot prediction and linear-probe evaluation.

Maintenance & Community

The project is maintained by OpenAI. Related projects like OpenCLIP offer larger models.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

While powerful for zero-shot tasks, CLIP's performance can vary depending on the text prompts used. For optimal results on specific tasks, fine-tuning or prompt engineering may be necessary. The provided examples use older PyTorch versions (1.7.1), and compatibility with newer versions should be verified.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

335 stars in the last 30 days