Discover and explore top open-source AI tools and projects—updated daily.
gkamradtLLM testing tool for evaluating in-context retrieval accuracy
Top 21.5% on SourcePulse
This project provides a framework for pressure-testing the in-context retrieval capabilities of large language models (LLMs) across various context lengths and document depths. It's designed for researchers and developers evaluating LLM performance, offering a systematic way to measure accuracy when a specific piece of information (the "needle") is embedded within a large body of text (the "haystack").
How It Works
The core approach involves placing a specific fact or statement (the "needle") at a defined position within a large text document (the "haystack"). The LLM is then prompted to retrieve this needle. The system iterates through different context lengths and "document depths" (the percentage of the context window where the needle is placed) to quantify retrieval accuracy. This method allows for direct comparison of how well different LLMs handle long contexts and precise information recall.
Quick Start & Requirements
pip install needlehaystackNIAH_MODEL_API_KEY, NIAH_EVALUATOR_API_KEY).needlehaystack.run_testneedlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"Highlighted Details
Maintenance & Community
The project appears to be actively maintained by gkamradt. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.
Licensing & Compatibility
Limitations & Caveats
The README notes that the script has been significantly upgraded since original tests, and data formats may not match older results. Saving contexts is warned against due to potential file size. Only text files are supported for the haystack.
1 year ago
Inactive
google-deepmind