clip-score  by Taited

Calculate CLIP text-image similarity scores

Created 2 years ago
273 stars

Top 94.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides efficient batch-wise calculation of CLIP scores, measuring text-image similarity using pre-trained CLIP models. It's designed for researchers and developers working with generative models or evaluating image-text alignment, offering a convenient way to quantify how well an image matches a textual description.

How It Works

The project leverages Hugging Face's transformers library to load CLIP models, specifically defaulting to openai/clip-vit-base-patch32. It calculates cosine similarity between image embeddings and text embeddings. The recent update removes the previous 100x scaling factor, outputting direct cosine similarity values, ensuring scores are less than 1. This approach simplifies interpretation and integrates seamlessly with the Hugging Face ecosystem.

Quick Start & Requirements

  • Install via pip: pip install clip-score
  • Requires PyTorch.
  • Supports GPU acceleration automatically; specify device with --device cuda:N or --device cpu.
  • Data input requires paired image (.png, .jpg) and text (.txt) files in separate directories, matched by filename.
  • Example usage: python -m clip_score path/to/images path/to/text or python -m clip_score path/to/images "your prompt here".

Highlighted Details

  • Hugging Face integration for convenient model loading.
  • Flexible text input allows single sentence prompts for batch image evaluation.
  • Supports calculating scores between images or texts by specifying --real_flag and --fake_flag.
  • Outputs normalized cosine similarity values directly.

Maintenance & Community

The project was last updated in April 2025 with significant improvements. The primary contributor is Taited/SUN Zhengwentai. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

Licensed under the Apache License 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project plans to add support for additional vision-language models like DINO and BLIP, indicating these are not yet implemented. The data input structure requires strict adherence to filename matching between image and text directories.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
57 more.

stable-diffusion by CompVis

0.1%
71k
Latent text-to-image diffusion model
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.