similarities by shibing624

Toolkit for similarity calculation and semantic search

Created 3 years ago

891 stars

Top 40.6% on SourcePulse

Project Summary

This toolkit provides efficient similarity calculation and semantic search for text and images, targeting developers and researchers working with large datasets. It offers a unified interface for various similarity metrics and search algorithms, enabling applications like text-to-text, text-to-image, and image-to-image retrieval with support for billions of data points.

How It Works

The toolkit leverages state-of-the-art models like CoSENT and CLIP for semantic understanding, converting text and images into dense vector embeddings. It supports multiple similarity calculation methods (Cosine, Dot Product, Hamming, Euclidean) and integrates efficient approximate nearest neighbor (ANN) search libraries such as Faiss, Annoy, and Hnswlib for high-throughput retrieval. Literal matching algorithms like BM25, TFIDF, and SimHash are also included for baseline comparisons or specific use cases.

Quick Start & Requirements

Install via pip: pip install -U similarities
PyTorch is a core dependency. GPU acceleration is recommended for performance.
Official demos available on Hugging Face Spaces: Image Search Demo, Text Search Demo
Extensive examples are provided within the repository.

Highlighted Details

Supports text-to-text, text-to-image, and image-to-image search.
Integrates semantic models (CoSENT, CLIP) and literal matching (BM25, SimHash).
Utilizes Faiss for efficient, billion-scale ANN search with GPU acceleration.
Offers a command-line interface (CLI) for embedding extraction, indexing, batch retrieval, and server deployment.

Maintenance & Community

The project is actively maintained by shibing624. Community engagement is encouraged via GitHub Issues.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use. A link to the project and license must be included in derivative products.

Limitations & Caveats

While supporting billion-scale data, performance with ANN algorithms is dependent on index configuration and hardware. Some examples are specific to Chinese CLIP models.

similarities by shibing624

Explore Similar Projects

clip-score by Taited

awesome-document-similarity by malteos

CLIPPyX by 0ssamaak0

awesome-semantic-search by Agrover112

vector-search-class-notes by edoliberty

vectordb by kagisearch

similarity-search-kit by ZachNagengast

swiss_army_llama by Dicklesworthstone

natural-language-image-search by haltakov

text_similarity by adsieg

clip-retrieval by rom1504

typesense by typesense