OpenAI-CLIP  by moein-shariatnia

PyTorch CLIP implementation for text-image retrieval

created 4 years ago
696 stars

Top 49.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a simplified PyTorch implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training) model. It's designed for researchers and developers interested in understanding and utilizing multimodal learning for tasks like image retrieval based on text queries. The project offers a clear, step-by-step guide to building and training a CLIP model from scratch.

How It Works

The implementation follows the core CLIP methodology: contrastive learning between image and text embeddings. It utilizes a ResNet50 (via timm) as the image encoder and DistilBERT (via HuggingFace transformers) as the text encoder. Both encoders project their outputs into a shared embedding space using separate projection heads. The training objective is a contrastive loss that aims to maximize the similarity between embeddings of corresponding image-text pairs while minimizing similarity for non-matching pairs.

Quick Start & Requirements

  • Install dependencies: pip install timm transformers
  • Requires PyTorch and a GPU for reasonable training times.
  • The provided main() function orchestrates training using the Flicker-8k dataset (paths configurable in CFG).
  • Official documentation and demo are integrated within the README.

Highlighted Details

  • Implements CLIP from scratch in PyTorch.
  • Uses DistilBERT for efficient text encoding.
  • Custom contrastive loss function detailed with explanations.
  • Includes functions for training, validation, and inference (image retrieval).
  • Model architecture includes projection heads to align image and text embeddings.

Maintenance & Community

The project is maintained by Moein Shariatnia. The README highlights several academic papers that have cited or used this code, indicating community adoption and validation.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This is a critical omission for evaluating commercial use or integration into closed-source projects.

Limitations & Caveats

The lack of an explicit license is a significant limitation. The code is presented as a tutorial and may require adjustments for production environments, particularly regarding dataset handling and error management. Training can be time-consuming without a GPU.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.