CLIP  by openai

Image-text matching model for zero-shot prediction

created 4 years ago
30,054 stars

Top 1.2% on sourcepulse

GitHubView on GitHub
Project Summary

CLIP (Contrastive Language-Image Pre-Training) enables zero-shot image classification by learning from image-text pairs. It allows users to predict the most relevant text snippet for a given image without task-specific fine-tuning, achieving performance comparable to traditional supervised methods on benchmarks like ImageNet. This is particularly useful for researchers and developers needing flexible image understanding capabilities.

How It Works

CLIP employs a transformer-based architecture trained on a massive dataset of image-text pairs. It learns to embed images and text into a shared multimodal embedding space. By calculating the cosine similarity between image and text embeddings, CLIP can determine the relevance of text descriptions to an image, effectively performing zero-shot classification. This approach bypasses the need for labeled datasets for new tasks.

Quick Start & Requirements

  • Install: pip install git+https://github.com/openai/CLIP.git
  • Prerequisites: PyTorch 1.7.1+ and torchvision. CUDA 11.0+ recommended for GPU acceleration.
  • Setup: Minimal setup time, primarily involves installing dependencies and downloading model weights.
  • Docs: Blog, Paper, Model Card, Colab

Highlighted Details

  • Achieves zero-shot performance on ImageNet without using labeled examples.
  • Supports multiple model variants (e.g., ViT-B/32).
  • Provides methods for encoding images, encoding text, and direct model inference.
  • Includes examples for zero-shot prediction and linear-probe evaluation.

Maintenance & Community

The project is maintained by OpenAI. Related projects like OpenCLIP offer larger models.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

While powerful for zero-shot tasks, CLIP's performance can vary depending on the text prompts used. For optimal results on specific tasks, fine-tuning or prompt engineering may be necessary. The provided examples use older PyTorch versions (1.7.1), and compatibility with newer versions should be verified.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
1,413 stars in the last 90 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Chenlin Meng Chenlin Meng(Cofounder of Pika), and
4 more.

clip-retrieval by rom1504

0.3%
3k
CLIP retrieval system for semantic search
created 4 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.