LLMTest_NeedleInAHaystack by gkamradt

LLM testing tool for evaluating in-context retrieval accuracy

Created 2 years ago

2,136 stars

Top 20.8% on SourcePulse

View on GitHub

8 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Yaowei Zheng

Author of LLaMA-Factory

Binyuan Hui

Research Scientist at Alibaba Qwen

Elvis Saravia

Founder of DAIR.AI

and 4 more!

Project Summary

This project provides a framework for pressure-testing the in-context retrieval capabilities of large language models (LLMs) across various context lengths and document depths. It's designed for researchers and developers evaluating LLM performance, offering a systematic way to measure accuracy when a specific piece of information (the "needle") is embedded within a large body of text (the "haystack").

How It Works

The core approach involves placing a specific fact or statement (the "needle") at a defined position within a large text document (the "haystack"). The LLM is then prompted to retrieve this needle. The system iterates through different context lengths and "document depths" (the percentage of the context window where the needle is placed) to quantify retrieval accuracy. This method allows for direct comparison of how well different LLMs handle long contexts and precise information recall.

Quick Start & Requirements

Install via pip: pip install needlehaystack
Requires API keys for supported model providers (OpenAI, Anthropic, Cohere) set as environment variables (NIAH_MODEL_API_KEY, NIAH_EVALUATOR_API_KEY).
Run tests using the command-line entry point: needlehaystack.run_test
Example: needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"
Official documentation and examples are available.

Highlighted Details

Supports OpenAI, Anthropic, and Cohere model providers.
Offers both model-based and LangSmith-based evaluation strategies.
Includes functionality for multi-needle insertion and distribution.
Provides visualization scripts for analyzing results.

Maintenance & Community

The project appears to be actively maintained by gkamradt. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

Licensed under the MIT License.
Permissive for commercial use and closed-source linking, requiring attribution.

Limitations & Caveats

The README notes that the script has been significantly upgraded since original tests, and data formats may not match older results. Saving contexts is warned against due to potential file size. Only text files are supported for the haystack.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

43 stars in the last 30 days