OpenAI-CLIP by moein-shariatnia

PyTorch CLIP implementation for text-image retrieval

Created 4 years ago

719 stars

Top 47.8% on SourcePulse

Project Summary

This repository provides a simplified PyTorch implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training) model. It's designed for researchers and developers interested in understanding and utilizing multimodal learning for tasks like image retrieval based on text queries. The project offers a clear, step-by-step guide to building and training a CLIP model from scratch.

How It Works

The implementation follows the core CLIP methodology: contrastive learning between image and text embeddings. It utilizes a ResNet50 (via timm) as the image encoder and DistilBERT (via HuggingFace transformers) as the text encoder. Both encoders project their outputs into a shared embedding space using separate projection heads. The training objective is a contrastive loss that aims to maximize the similarity between embeddings of corresponding image-text pairs while minimizing similarity for non-matching pairs.

Quick Start & Requirements

Install dependencies: pip install timm transformers
Requires PyTorch and a GPU for reasonable training times.
The provided main() function orchestrates training using the Flicker-8k dataset (paths configurable in CFG).
Official documentation and demo are integrated within the README.

Highlighted Details

Implements CLIP from scratch in PyTorch.
Uses DistilBERT for efficient text encoding.
Custom contrastive loss function detailed with explanations.
Includes functions for training, validation, and inference (image retrieval).
Model architecture includes projection heads to align image and text embeddings.

Maintenance & Community

The project is maintained by Moein Shariatnia. The README highlights several academic papers that have cited or used this code, indicating community adoption and validation.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This is a critical omission for evaluating commercial use or integration into closed-source projects.

Limitations & Caveats

The lack of an explicit license is a significant limitation. The code is presented as a tutorial and may require adjustments for production environments, particularly regarding dataset handling and error management. Training can be time-consuming without a GPU.

OpenAI-CLIP by moein-shariatnia

Explore Similar Projects

CrossFlow by qihao067

CM3Leon by kyegomez

clip-score by Taited

CLIPPyX by 0ssamaak0

CLIP-ImageSearch-NCNN by EdVince

fromage by kohjingyu

gill by kohjingyu

tidy by slavabarkov

mindall-e by kakaobrain

natural-language-image-search by haltakov

ImageBind by facebookresearch

CLIP by openai