ann-benchmarks by erikbern

ANN benchmarks for approximate nearest neighbor search algorithms

Created 10 years ago

5,569 stars

Top 9.0% on SourcePulse

View on GitHub

19 Experts Love This Project

Author of "AI Engineering", "Designing Machine Learning Systems"

and 15 more!

Project Summary

This project provides a comprehensive benchmarking framework for approximate nearest neighbor (ANN) search libraries, targeting researchers and engineers working with high-dimensional data. It offers standardized datasets, Dockerized environments for each algorithm, and tools for reproducible evaluation, enabling objective comparison of ANN library performance.

How It Works

The framework utilizes pre-generated HDF5 datasets with ground truth for top-100 nearest neighbors. Each ANN library is encapsulated within a Docker container, ensuring consistent execution environments. Benchmarking is performed using Python scripts that orchestrate the indexing, querying, and result collection, with a focus on single-CPU saturation and reproducible parameter tuning.

Quick Start & Requirements

Install: pip install -r requirements.txt followed by python install.py.
Prerequisites: Python (3.10.6 tested), Docker.
Setup Time: install.py can take 10-30 minutes. Running benchmarks (run.py) can take days.
Links: ann-benchmarks.com

Highlighted Details

Benchmarks over 40 ANN libraries including FAISS, NMSLIB, ScaNN, and Elasticsearch.
Supports various datasets (SIFT, GloVe, MNIST, etc.) with dimensions from 25 to 27,983.
Results are presented as plots and can be used to generate a website.
Includes a reproducibility protocol and related publications.

Maintenance & Community

Authors: Erik Bernhardsson, Martin Aumüller, Alexander Faithfull.
Open to pull requests for improvements and new library integrations.

Licensing & Compatibility

License: Not explicitly stated in the README.
Compatibility: Primarily CPU-based algorithms; GPU support is mentioned for FAISS but requires local compilation. Datasets fit in RAM.

Limitations & Caveats

The project focuses on CPU-based ANN algorithms and datasets that fit in RAM; billion-scale benchmarks are handled by a separate project. GPU support for libraries like FAISS requires local compilation and specific flags. The README mentions results are as of April 2025, implying potential for updates.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days