similarities  by shibing624

Toolkit for similarity calculation and semantic search

created 3 years ago
863 stars

Top 42.4% on sourcepulse

GitHubView on GitHub
Project Summary

This toolkit provides efficient similarity calculation and semantic search for text and images, targeting developers and researchers working with large datasets. It offers a unified interface for various similarity metrics and search algorithms, enabling applications like text-to-text, text-to-image, and image-to-image retrieval with support for billions of data points.

How It Works

The toolkit leverages state-of-the-art models like CoSENT and CLIP for semantic understanding, converting text and images into dense vector embeddings. It supports multiple similarity calculation methods (Cosine, Dot Product, Hamming, Euclidean) and integrates efficient approximate nearest neighbor (ANN) search libraries such as Faiss, Annoy, and Hnswlib for high-throughput retrieval. Literal matching algorithms like BM25, TFIDF, and SimHash are also included for baseline comparisons or specific use cases.

Quick Start & Requirements

  • Install via pip: pip install -U similarities
  • PyTorch is a core dependency. GPU acceleration is recommended for performance.
  • Official demos available on Hugging Face Spaces: Image Search Demo, Text Search Demo
  • Extensive examples are provided within the repository.

Highlighted Details

  • Supports text-to-text, text-to-image, and image-to-image search.
  • Integrates semantic models (CoSENT, CLIP) and literal matching (BM25, SimHash).
  • Utilizes Faiss for efficient, billion-scale ANN search with GPU acceleration.
  • Offers a command-line interface (CLI) for embedding extraction, indexing, batch retrieval, and server deployment.

Maintenance & Community

The project is actively maintained by shibing624. Community engagement is encouraged via GitHub Issues.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use. A link to the project and license must be included in derivative products.

Limitations & Caveats

While supporting billion-scale data, performance with ANN algorithms is dependent on index configuration and hardware. Some examples are specific to Chinese CLIP models.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Chenlin Meng Chenlin Meng(Cofounder of Pika), and
4 more.

clip-retrieval by rom1504

0.3%
3k
CLIP retrieval system for semantic search
created 4 years ago
updated 1 year ago
Feedback? Help us improve.