bm25s by xhluca

Ultrafast Python BM25 implementation for lexical search

Created 2 years ago

1,730 stars

Top 23.8% on SourcePulse

View on GitHub

6 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Bryan Helmig

Cofounder of Zapier

and 2 more!

Project Summary

Summary

BM25S (BM25-Sparse) is an ultrafast Python library for lexical search, implementing the BM25 ranking function. It is designed for engineers and researchers seeking high-performance text retrieval without heavy dependencies like Java or PyTorch. The library offers orders-of-magnitude speed improvements over existing Python implementations by leveraging sparse matrices for eager score computation and an optional Numba backend for further acceleration, making it suitable for large-scale text indexing and querying tasks.

How It Works

The core innovation lies in using Numpy and Scipy to create sparse matrices that store pre-computed scores for document tokens. This "eager sparse scoring" approach drastically reduces query-time computation compared to traditional methods. An optional Numba backend further optimizes performance by compiling Python code, yielding approximately a 2x speedup on larger datasets. This design prioritizes speed and memory efficiency.

Quick Start & Requirements

Installation is straightforward via pip: pip install bm25s. For enhanced functionality like stemming, install with pip install "bm25s[full]" or pip install PyStemmer. Optional JAX (pip install "jax[cpu]") can speed up top-k selection. Core dependencies include Numpy and Scipy. Links to a technical report and blog post are mentioned but not provided. Example usage and advanced examples are available within the repository.

Highlighted Details

Achieves "orders of magnitude faster lexical search via eager sparse scoring."
Numba backend offers ~2x speedup for larger datasets.
Supports Hugging Face integration for sharing and loading BM25 indices.
Includes a built-in Model Context Protocol (MCP) server for LLM agent integration.
Offers memory-efficient retrieval via memory-mapping (mmap=True).
Provides multiple BM25 variants (e.g., ATIRE, BM25L, BM25+).

Maintenance & Community

The README does not detail specific contributors, sponsorships, or community channels (e.g., Discord, Slack).

Licensing & Compatibility

The primary license for the bm25s project is not explicitly stated in the README. A utility function is noted as being Apache 2.0 licensed, borrowed from the BEIR library. This lack of a clear project-wide license is a significant point for due diligence, especially concerning commercial use or integration into closed-source projects.

Limitations & Caveats

No explicit limitations or known bugs are detailed in the provided text. The project appears actively developed, with recent updates mentioning Numba support.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

27 stars in the last 30 days