anserini by castorini

Lucene toolkit for reproducible information retrieval research

Created 10 years ago

1,098 stars

Top 34.5% on SourcePulse

View on GitHub

4 Experts Love This Project

Research Scientist at Ai2

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Anserini is a comprehensive Java-based toolkit designed for reproducible information retrieval (IR) research, bridging the gap between academic IR and practical search applications. It provides researchers and engineers with a robust platform to build, evaluate, and reproduce IR experiments across various standard test collections.

How It Works

Anserini leverages Apache Lucene as its core indexing and retrieval engine. It offers implementations of numerous retrieval models, including traditional sparse methods like BM25 and advanced learned sparse models (e.g., uniCOIL, SPLADE) and dense vector models (e.g., BGE, OpenAI Ada2, Cohere). The toolkit supports efficient nearest-neighbor search through HNSW (Hierarchical Navigable Small Worlds) indexes for dense vectors. It also includes a built-in web application and REST API for interactive querying and integration.

Quick Start & Requirements

Install: Download the fatjar: wget https://repo1.maven.org/maven2/io/anserini/anserini/1.0.0/anserini-1.0.0-fatjar.jar
Prerequisites: Java 21, Maven 3.9+ (for building from source).
Setup: The fatjar provides a quick start; building from source requires cloning the repository with submodules and running mvn clean package.
Documentation: Anserini Onboarding

Highlighted Details

Supports a wide array of IR models, including sparse (BM25, SPLADE variants) and dense (HNSW, flat indexes) retrieval.
Provides end-to-end regression experiments for numerous standard test collections (MS MARCO, BEIR, TREC, etc.).
Includes tools for evaluating retrieval results (e.g., trec_eval, ndeval).
Offers prebuilt indexes for many corpora, which are automatically downloaded upon request.

Maintenance & Community

The project is actively maintained by the castorini group, with frequent releases and a comprehensive history of updates. Contributions are encouraged, with clear guidelines for reproducing results and submitting pull requests.

Licensing & Compatibility

Anserini is released under the Apache License 2.0, which permits commercial use and integration with closed-source applications.

Limitations & Caveats

Requires Java 21, which may be a constraint for some environments.
Prebuilt indexes can be very large, requiring significant disk space.
Windows users are strongly advised to use WSL2 due to potential encoding issues.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days