stark by snap-stanford

LLM retrieval benchmark on textual/relational knowledge bases (NeurIPS 2024)

Created 1 year ago

325 stars

Top 83.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

STaRK is a large-scale benchmark for evaluating retrieval systems, particularly those powered by Large Language Models (LLMs), across textual and semi-structured knowledge bases. It targets researchers and practitioners aiming to improve LLM-based information retrieval in domains like product search, academic discovery, and biomedicine, offering a standardized way to assess performance on complex, context-aware queries.

How It Works

STaRK provides datasets for retrieval tasks and semi-structured knowledge bases (SKBs). The benchmark supports various retrieval models, including BM25, Colbertv2, and LLM-based approaches like VSS and LLMReranker. It facilitates evaluation by allowing users to load data, generate or download embeddings for queries and documents, and run standardized evaluation scripts. The approach emphasizes practical, natural-sounding queries that require reasoning over diverse data sources.

Quick Start & Requirements

Install: pip install stark-qa
Prerequisites: Python >= 3.8 and < 3.12. For evaluation, llm2vec, gritlm, bm25 are needed. API keys for LLMs (OpenAI, Anthropic, Voyage) may be required.
Data Loading: Datasets are automatically downloaded via Hugging Face. Processing raw data for STaRK-Amazon and STaRK-MAG can take up to an hour.
Resources: Embeddings can be downloaded or generated; generation may require significant compute depending on the model.
Docs: https://stark.stanford.edu/docs/index.html

Highlighted Details

Benchmarks retrieval performance on textual and relational knowledge bases.
Includes diverse, human-generated query datasets for realistic evaluation.
Supports multiple embedding models (e.g., text-embedding-ada-002, GritLM/GritLM-7B).
Offers an interactive SKB Explorer and a Hugging Face leaderboard.

Maintenance & Community

Accepted to NeurIPS Dataset & Benchmark Track 2024.
Active development with recent updates including new retrieval models and leaderboard.
Project website: https://stark.stanford.edu/
Dataset on Hugging Face: https://huggingface.co/datasets/snap-stanford/stark

Licensing & Compatibility

License: MIT. Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The evaluation process requires careful management of embeddings and potentially API keys for LLM-based reranking models. Processing raw data can be time-consuming. The benchmark is relatively new, with ongoing community contributions and potential for evolving evaluation methodologies.

Health Check

Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days