evoc  by TutteInstitute

Fast clustering for embedding vectors

Created 2 years ago
301 stars

Top 88.3% on SourcePulse

GitHubView on GitHub
Project Summary

Embedding Vector Oriented Clustering (EVōC) is a Python library designed for rapid and flexible clustering of large, high-dimensional embedding vectors. It targets users working with embeddings from models like CLIP, sentence-transformers, OpenAI, and Cohere, offering significant speed improvements and reduced hyperparameter tuning compared to traditional methods like UMAP + HDBSCAN. The library excels at producing high-quality clusters efficiently, even for quantized vector formats.

How It Works

EVōC specializes in embedding vectors, optimizing its approach to bypass the time-consuming aspects of general-purpose clustering pipelines. It leverages techniques inspired by PLSCAN and density-based methods to achieve fast, CPU-bound clustering. A key innovation is its ability to generate multi-granularity clusters, providing a hierarchy of results from fine-grained to coarse-grained, and it natively supports clustering of int8 or binary quantized embeddings.

Quick Start & Requirements

Installation is straightforward via pip: pip install evoc. Core dependencies include numpy, scikit-learn, numba, tqdm, and tbb. No specialized hardware like GPUs is mentioned. Full documentation is available at https://evoc.readthedocs.io/en/latest/.

Highlighted Details

  • Achieves fast clustering of embedding vectors on CPU.
  • Supports multi-granularity clustering with automatic cluster number selection and hierarchy extraction.
  • Natively handles clustering of int8 or binary quantized embedding vectors.
  • Includes automatic detection of duplicate or near-duplicate vectors.

Maintenance & Community

The project welcomes contributions via pull requests. Specific community channels (e.g., Discord, Slack), roadmap, or notable contributors/sponsorships are not detailed in the README.

Licensing & Compatibility

EVōC is released under the permissive BSD (2-clause) license, allowing for broad compatibility, including commercial use.

Limitations & Caveats

The library is explicitly described as an "early beta version," with a warning that "Things can and will break right now." Users should expect potential instability and are encouraged to provide feedback.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
53 stars in the last 30 days

Explore Similar Projects

Starred by Dominik Moritz Dominik Moritz(Research Scientist at Apple; Professor at CMU) and Casey Caruso Casey Caruso(Managing Partner of Topology Ventures).

latent-scope by enjalot

0%
758
Scientific tool for latent space investigation
Created 3 years ago
Updated 1 month ago
Feedback? Help us improve.