LLM retrieval benchmark on textual/relational knowledge bases (NeurIPS 2024)
Top 86.9% on sourcepulse
STaRK is a large-scale benchmark for evaluating retrieval systems, particularly those powered by Large Language Models (LLMs), across textual and semi-structured knowledge bases. It targets researchers and practitioners aiming to improve LLM-based information retrieval in domains like product search, academic discovery, and biomedicine, offering a standardized way to assess performance on complex, context-aware queries.
How It Works
STaRK provides datasets for retrieval tasks and semi-structured knowledge bases (SKBs). The benchmark supports various retrieval models, including BM25, Colbertv2, and LLM-based approaches like VSS and LLMReranker. It facilitates evaluation by allowing users to load data, generate or download embeddings for queries and documents, and run standardized evaluation scripts. The approach emphasizes practical, natural-sounding queries that require reasoning over diverse data sources.
Quick Start & Requirements
pip install stark-qa
llm2vec
, gritlm
, bm25
are needed. API keys for LLMs (OpenAI, Anthropic, Voyage) may be required.Highlighted Details
text-embedding-ada-002
, GritLM/GritLM-7B
).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The evaluation process requires careful management of embeddings and potentially API keys for LLM-based reranking models. Processing raw data can be time-consuming. The benchmark is relatively new, with ongoing community contributions and potential for evolving evaluation methodologies.
1 week ago
1 day