anserini  by castorini

Lucene toolkit for reproducible information retrieval research

created 9 years ago
1,065 stars

Top 36.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Anserini is a comprehensive Java-based toolkit designed for reproducible information retrieval (IR) research, bridging the gap between academic IR and practical search applications. It provides researchers and engineers with a robust platform to build, evaluate, and reproduce IR experiments across various standard test collections.

How It Works

Anserini leverages Apache Lucene as its core indexing and retrieval engine. It offers implementations of numerous retrieval models, including traditional sparse methods like BM25 and advanced learned sparse models (e.g., uniCOIL, SPLADE) and dense vector models (e.g., BGE, OpenAI Ada2, Cohere). The toolkit supports efficient nearest-neighbor search through HNSW (Hierarchical Navigable Small Worlds) indexes for dense vectors. It also includes a built-in web application and REST API for interactive querying and integration.

Quick Start & Requirements

  • Install: Download the fatjar: wget https://repo1.maven.org/maven2/io/anserini/anserini/1.0.0/anserini-1.0.0-fatjar.jar
  • Prerequisites: Java 21, Maven 3.9+ (for building from source).
  • Setup: The fatjar provides a quick start; building from source requires cloning the repository with submodules and running mvn clean package.
  • Documentation: Anserini Onboarding

Highlighted Details

  • Supports a wide array of IR models, including sparse (BM25, SPLADE variants) and dense (HNSW, flat indexes) retrieval.
  • Provides end-to-end regression experiments for numerous standard test collections (MS MARCO, BEIR, TREC, etc.).
  • Includes tools for evaluating retrieval results (e.g., trec_eval, ndeval).
  • Offers prebuilt indexes for many corpora, which are automatically downloaded upon request.

Maintenance & Community

The project is actively maintained by the castorini group, with frequent releases and a comprehensive history of updates. Contributions are encouraged, with clear guidelines for reproducing results and submitting pull requests.

Licensing & Compatibility

Anserini is released under the Apache License 2.0, which permits commercial use and integration with closed-source applications.

Limitations & Caveats

  • Requires Java 21, which may be a constraint for some environments.
  • Prebuilt indexes can be very large, requiring significant disk space.
  • Windows users are strongly advised to use WSL2 due to potential encoding issues.
Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
33
Issues (30d)
5
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Jason Liu Jason Liu(Author of Instructor) and Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code).

Search-R1 by PeterGriffinJin

1.3%
3k
RL framework for training LLMs to use search engines
created 5 months ago
updated 3 weeks ago
Feedback? Help us improve.