pyserini  by castorini

Python toolkit for reproducible information retrieval research

Created 5 years ago
1,939 stars

Top 22.6% on SourcePulse

GitHubView on GitHub
Project Summary

Pyserini is a Python toolkit for reproducible information retrieval (IR) research, enabling efficient first-stage retrieval using both sparse (e.g., BM25, uniCOIL, SPLADE) and dense (e.g., DPR, Contriever, BGE) representations. It targets researchers and practitioners in IR and NLP, offering prebuilt indexes, queries, relevance judgments, and evaluation scripts for numerous standard test collections, simplifying the reproduction of experimental runs.

How It Works

Pyserini integrates with Anserini (Lucene-based) for sparse retrieval and Faiss for dense retrieval. This dual approach allows for flexible and powerful retrieval strategies, including hybrid sparse-dense fusion. The toolkit is designed for ease of use and reproducibility, providing a self-contained Python package with comprehensive documentation and pre-configured experimental setups for various corpora.

Quick Start & Requirements

  • Install via PyPI: pip install pyserini
  • Requires Python 3.10+ and Java 21 (due to Anserini dependency).
  • Optional dependencies (e.g., faiss-cpu, lightgbm) can be installed with pip install 'pyserini[optional]'.
  • Detailed installation instructions and guides are available: Official Documentation

Highlighted Details

  • Supports a wide range of sparse and dense retrieval models, including learned sparse models and hybrid approaches.
  • Offers "two-click reproductions" for numerous standard IR test collections, facilitating experimental reproducibility.
  • Provides prebuilt indexes for popular corpora like MS MARCO, BEIR, and MIRACL, with sizes up to 72 GB.
  • Underwent a transition from Lucene 8 to Lucene 9, enabling HNSW index capabilities.

Maintenance & Community

  • Actively maintained with frequent releases (e.g., v0.44.0 in Jan 2025).
  • Primarily supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
  • References a SIGIR 2021 paper for an overview.

Licensing & Compatibility

  • The project does not explicitly state a license in the README. However, its dependencies (Lucene, Faiss, PyTorch, Transformers) have various open-source licenses. Users should verify compatibility for commercial or closed-source applications.

Limitations & Caveats

  • Installation of optional dependencies like Faiss can be temperamental.
  • Indexes built with Lucene 8 are not fully compatible with Lucene 9 code, though a workaround exists. Lucene 8 code cannot read Lucene 9 indexes.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
26
Issues (30d)
5
Star History
23 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.