LLM testing tool for evaluating in-context retrieval accuracy
Top 22.9% on sourcepulse
This project provides a framework for pressure-testing the in-context retrieval capabilities of large language models (LLMs) across various context lengths and document depths. It's designed for researchers and developers evaluating LLM performance, offering a systematic way to measure accuracy when a specific piece of information (the "needle") is embedded within a large body of text (the "haystack").
How It Works
The core approach involves placing a specific fact or statement (the "needle") at a defined position within a large text document (the "haystack"). The LLM is then prompted to retrieve this needle. The system iterates through different context lengths and "document depths" (the percentage of the context window where the needle is placed) to quantify retrieval accuracy. This method allows for direct comparison of how well different LLMs handle long contexts and precise information recall.
Quick Start & Requirements
pip install needlehaystack
NIAH_MODEL_API_KEY
, NIAH_EVALUATOR_API_KEY
).needlehaystack.run_test
needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"
Highlighted Details
Maintenance & Community
The project appears to be actively maintained by gkamradt. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
Limitations & Caveats
The README notes that the script has been significantly upgraded since original tests, and data formats may not match older results. Saving contexts is warned against due to potential file size. Only text files are supported for the haystack.
11 months ago
1 day