Lucene toolkit for reproducible information retrieval research
Top 36.0% on sourcepulse
Anserini is a comprehensive Java-based toolkit designed for reproducible information retrieval (IR) research, bridging the gap between academic IR and practical search applications. It provides researchers and engineers with a robust platform to build, evaluate, and reproduce IR experiments across various standard test collections.
How It Works
Anserini leverages Apache Lucene as its core indexing and retrieval engine. It offers implementations of numerous retrieval models, including traditional sparse methods like BM25 and advanced learned sparse models (e.g., uniCOIL, SPLADE) and dense vector models (e.g., BGE, OpenAI Ada2, Cohere). The toolkit supports efficient nearest-neighbor search through HNSW (Hierarchical Navigable Small Worlds) indexes for dense vectors. It also includes a built-in web application and REST API for interactive querying and integration.
Quick Start & Requirements
wget https://repo1.maven.org/maven2/io/anserini/anserini/1.0.0/anserini-1.0.0-fatjar.jar
mvn clean package
.Highlighted Details
trec_eval
, ndeval
).Maintenance & Community
The project is actively maintained by the castorini group, with frequent releases and a comprehensive history of updates. Contributions are encouraged, with clear guidelines for reproducing results and submitting pull requests.
Licensing & Compatibility
Anserini is released under the Apache License 2.0, which permits commercial use and integration with closed-source applications.
Limitations & Caveats
1 day ago
Inactive