clip-score by Taited

Calculate CLIP text-image similarity scores

Created 2 years ago

304 stars

Top 88.2% on SourcePulse

Project Summary

This repository provides efficient batch-wise calculation of CLIP scores, measuring text-image similarity using pre-trained CLIP models. It's designed for researchers and developers working with generative models or evaluating image-text alignment, offering a convenient way to quantify how well an image matches a textual description.

How It Works

The project leverages Hugging Face's transformers library to load CLIP models, specifically defaulting to openai/clip-vit-base-patch32. It calculates cosine similarity between image embeddings and text embeddings. The recent update removes the previous 100x scaling factor, outputting direct cosine similarity values, ensuring scores are less than 1. This approach simplifies interpretation and integrates seamlessly with the Hugging Face ecosystem.

Quick Start & Requirements

Install via pip: pip install clip-score
Requires PyTorch.
Supports GPU acceleration automatically; specify device with --device cuda:N or --device cpu.
Data input requires paired image (.png, .jpg) and text (.txt) files in separate directories, matched by filename.
Example usage: python -m clip_score path/to/images path/to/text or python -m clip_score path/to/images "your prompt here".

Highlighted Details

Hugging Face integration for convenient model loading.
Flexible text input allows single sentence prompts for batch image evaluation.
Supports calculating scores between images or texts by specifying --real_flag and --fake_flag.
Outputs normalized cosine similarity values directly.

Maintenance & Community

The project was last updated in April 2025 with significant improvements. The primary contributor is Taited/SUN Zhengwentai. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache License 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project plans to add support for additional vision-language models like DINO and BLIP, indicating these are not yet implemented. The data input structure requires strict adherence to filename matching between image and text directories.

clip-score by Taited

Explore Similar Projects

CLIPPyX by 0ssamaak0

MAGIC by yxuansu

coyo-dataset by kakaobrain

dift by Tsingularity

karlo by kakaobrain

VARAG by adithya-s-k

CLIP-Chinese by yangjianxin1

similarities by shibing624

OpenAI-CLIP by moein-shariatnia

Monkey by Yuliang-Liu

clip-interrogator by pharmapsychotic

stable-diffusion by CompVis