bm25s  by xhluca

Ultrafast Python BM25 implementation for lexical search

Created 1 year ago
1,495 stars

Top 27.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

BM25S (BM25-Sparse) is an ultrafast Python library for lexical search, implementing the BM25 ranking function. It is designed for engineers and researchers seeking high-performance text retrieval without heavy dependencies like Java or PyTorch. The library offers orders-of-magnitude speed improvements over existing Python implementations by leveraging sparse matrices for eager score computation and an optional Numba backend for further acceleration, making it suitable for large-scale text indexing and querying tasks.

How It Works

The core innovation lies in using Numpy and Scipy to create sparse matrices that store pre-computed scores for document tokens. This "eager sparse scoring" approach drastically reduces query-time computation compared to traditional methods. An optional Numba backend further optimizes performance by compiling Python code, yielding approximately a 2x speedup on larger datasets. This design prioritizes speed and memory efficiency.

Quick Start & Requirements

Installation is straightforward via pip: pip install bm25s. For enhanced functionality like stemming, install with pip install "bm25s[full]" or pip install PyStemmer. Optional JAX (pip install "jax[cpu]") can speed up top-k selection. Core dependencies include Numpy and Scipy. Links to a technical report and blog post are mentioned but not provided. Example usage and advanced examples are available within the repository.

Highlighted Details

  • Achieves "orders of magnitude faster lexical search via eager sparse scoring."
  • Numba backend offers ~2x speedup for larger datasets.
  • Supports Hugging Face integration for sharing and loading BM25 indices.
  • Includes a built-in Model Context Protocol (MCP) server for LLM agent integration.
  • Offers memory-efficient retrieval via memory-mapping (mmap=True).
  • Provides multiple BM25 variants (e.g., ATIRE, BM25L, BM25+).

Maintenance & Community

The README does not detail specific contributors, sponsorships, or community channels (e.g., Discord, Slack).

Licensing & Compatibility

The primary license for the bm25s project is not explicitly stated in the README. A utility function is noted as being Apache 2.0 licensed, borrowed from the BEIR library. This lack of a clear project-wide license is a significant point for due diligence, especially concerning commercial use or integration into closed-source projects.

Limitations & Caveats

No explicit limitations or known bugs are detailed in the provided text. The project appears actively developed, with recent updates mentioning Numba support.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
0
Star History
31 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Simon Willison Simon Willison(Coauthor of Django).

semantra by freedmand

0.1%
3k
CLI tool for semantic document search
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.