OpenAI-CLIP  by moein-shariatnia

PyTorch CLIP implementation for text-image retrieval

Created 4 years ago
703 stars

Top 48.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a simplified PyTorch implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training) model. It's designed for researchers and developers interested in understanding and utilizing multimodal learning for tasks like image retrieval based on text queries. The project offers a clear, step-by-step guide to building and training a CLIP model from scratch.

How It Works

The implementation follows the core CLIP methodology: contrastive learning between image and text embeddings. It utilizes a ResNet50 (via timm) as the image encoder and DistilBERT (via HuggingFace transformers) as the text encoder. Both encoders project their outputs into a shared embedding space using separate projection heads. The training objective is a contrastive loss that aims to maximize the similarity between embeddings of corresponding image-text pairs while minimizing similarity for non-matching pairs.

Quick Start & Requirements

  • Install dependencies: pip install timm transformers
  • Requires PyTorch and a GPU for reasonable training times.
  • The provided main() function orchestrates training using the Flicker-8k dataset (paths configurable in CFG).
  • Official documentation and demo are integrated within the README.

Highlighted Details

  • Implements CLIP from scratch in PyTorch.
  • Uses DistilBERT for efficient text encoding.
  • Custom contrastive loss function detailed with explanations.
  • Includes functions for training, validation, and inference (image retrieval).
  • Model architecture includes projection heads to align image and text embeddings.

Maintenance & Community

The project is maintained by Moein Shariatnia. The README highlights several academic papers that have cited or used this code, indicating community adoption and validation.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This is a critical omission for evaluating commercial use or integration into closed-source projects.

Limitations & Caveats

The lack of an explicit license is a significant limitation. The code is presented as a tutorial and may require adjustments for production environments, particularly regarding dataset handling and error management. Training can be time-consuming without a GPU.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
482
Multimodal model for grounding language models to images
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.