CLIP  by openai

Image-text matching model for zero-shot prediction

Created 5 years ago
32,228 stars

Top 1.1% on SourcePulse

GitHubView on GitHub
Project Summary

CLIP (Contrastive Language-Image Pre-Training) enables zero-shot image classification by learning from image-text pairs. It allows users to predict the most relevant text snippet for a given image without task-specific fine-tuning, achieving performance comparable to traditional supervised methods on benchmarks like ImageNet. This is particularly useful for researchers and developers needing flexible image understanding capabilities.

How It Works

CLIP employs a transformer-based architecture trained on a massive dataset of image-text pairs. It learns to embed images and text into a shared multimodal embedding space. By calculating the cosine similarity between image and text embeddings, CLIP can determine the relevance of text descriptions to an image, effectively performing zero-shot classification. This approach bypasses the need for labeled datasets for new tasks.

Quick Start & Requirements

  • Install: pip install git+https://github.com/openai/CLIP.git
  • Prerequisites: PyTorch 1.7.1+ and torchvision. CUDA 11.0+ recommended for GPU acceleration.
  • Setup: Minimal setup time, primarily involves installing dependencies and downloading model weights.
  • Docs: Blog, Paper, Model Card, Colab

Highlighted Details

  • Achieves zero-shot performance on ImageNet without using labeled examples.
  • Supports multiple model variants (e.g., ViT-B/32).
  • Provides methods for encoding images, encoding text, and direct model inference.
  • Includes examples for zero-shot prediction and linear-probe evaluation.

Maintenance & Community

The project is maintained by OpenAI. Related projects like OpenCLIP offer larger models.

Licensing & Compatibility

MIT License. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

While powerful for zero-shot tasks, CLIP's performance can vary depending on the text prompts used. For optimal results on specific tasks, fine-tuning or prompt engineering may be necessary. The provided examples use older PyTorch versions (1.7.1), and compatibility with newer versions should be verified.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
0
Star History
335 stars in the last 30 days

Explore Similar Projects

Starred by Jianwei Yang Jianwei Yang(Research Scientist at Meta Superintelligence Lab), Jiaming Song Jiaming Song(Chief Scientist at Luma AI), and
5 more.

X-Decoder by microsoft

0%
1k
Generalized decoding model for pixel, image, and language tasks
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.