stark  by snap-stanford

LLM retrieval benchmark on textual/relational knowledge bases (NeurIPS 2024)

created 1 year ago
315 stars

Top 86.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

STaRK is a large-scale benchmark for evaluating retrieval systems, particularly those powered by Large Language Models (LLMs), across textual and semi-structured knowledge bases. It targets researchers and practitioners aiming to improve LLM-based information retrieval in domains like product search, academic discovery, and biomedicine, offering a standardized way to assess performance on complex, context-aware queries.

How It Works

STaRK provides datasets for retrieval tasks and semi-structured knowledge bases (SKBs). The benchmark supports various retrieval models, including BM25, Colbertv2, and LLM-based approaches like VSS and LLMReranker. It facilitates evaluation by allowing users to load data, generate or download embeddings for queries and documents, and run standardized evaluation scripts. The approach emphasizes practical, natural-sounding queries that require reasoning over diverse data sources.

Quick Start & Requirements

  • Install: pip install stark-qa
  • Prerequisites: Python >= 3.8 and < 3.12. For evaluation, llm2vec, gritlm, bm25 are needed. API keys for LLMs (OpenAI, Anthropic, Voyage) may be required.
  • Data Loading: Datasets are automatically downloaded via Hugging Face. Processing raw data for STaRK-Amazon and STaRK-MAG can take up to an hour.
  • Resources: Embeddings can be downloaded or generated; generation may require significant compute depending on the model.
  • Docs: https://stark.stanford.edu/docs/index.html

Highlighted Details

  • Benchmarks retrieval performance on textual and relational knowledge bases.
  • Includes diverse, human-generated query datasets for realistic evaluation.
  • Supports multiple embedding models (e.g., text-embedding-ada-002, GritLM/GritLM-7B).
  • Offers an interactive SKB Explorer and a Hugging Face leaderboard.

Maintenance & Community

Licensing & Compatibility

  • License: MIT. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The evaluation process requires careful management of embeddings and potentially API keys for LLM-based reranking models. Processing raw data can be time-consuming. The benchmark is relatively new, with ongoing community contributions and potential for evolving evaluation methodologies.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.