pyserini  by castorini

Python toolkit for reproducible information retrieval research

created 5 years ago
1,896 stars

Top 23.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Pyserini is a Python toolkit for reproducible information retrieval (IR) research, enabling efficient first-stage retrieval using both sparse (e.g., BM25, uniCOIL, SPLADE) and dense (e.g., DPR, Contriever, BGE) representations. It targets researchers and practitioners in IR and NLP, offering prebuilt indexes, queries, relevance judgments, and evaluation scripts for numerous standard test collections, simplifying the reproduction of experimental runs.

How It Works

Pyserini integrates with Anserini (Lucene-based) for sparse retrieval and Faiss for dense retrieval. This dual approach allows for flexible and powerful retrieval strategies, including hybrid sparse-dense fusion. The toolkit is designed for ease of use and reproducibility, providing a self-contained Python package with comprehensive documentation and pre-configured experimental setups for various corpora.

Quick Start & Requirements

  • Install via PyPI: pip install pyserini
  • Requires Python 3.10+ and Java 21 (due to Anserini dependency).
  • Optional dependencies (e.g., faiss-cpu, lightgbm) can be installed with pip install 'pyserini[optional]'.
  • Detailed installation instructions and guides are available: Official Documentation

Highlighted Details

  • Supports a wide range of sparse and dense retrieval models, including learned sparse models and hybrid approaches.
  • Offers "two-click reproductions" for numerous standard IR test collections, facilitating experimental reproducibility.
  • Provides prebuilt indexes for popular corpora like MS MARCO, BEIR, and MIRACL, with sizes up to 72 GB.
  • Underwent a transition from Lucene 8 to Lucene 9, enabling HNSW index capabilities.

Maintenance & Community

  • Actively maintained with frequent releases (e.g., v0.44.0 in Jan 2025).
  • Primarily supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
  • References a SIGIR 2021 paper for an overview.

Licensing & Compatibility

  • The project does not explicitly state a license in the README. However, its dependencies (Lucene, Faiss, PyTorch, Transformers) have various open-source licenses. Users should verify compatibility for commercial or closed-source applications.

Limitations & Caveats

  • Installation of optional dependencies like Faiss can be temperamental.
  • Indexes built with Lucene 8 are not fully compatible with Lucene 9 code, though a workaround exists. Lucene 8 code cannot read Lucene 9 indexes.
Health Check
Last commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
37
Issues (30d)
12
Star History
87 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.3%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Zhiqiang Xie Zhiqiang Xie(Author of SGLang), and
7 more.

milvus by milvus-io

0.4%
36k
Cloud-native vector database for scalable ANN search
created 5 years ago
updated 1 day ago
Feedback? Help us improve.