LLMTest_NeedleInAHaystack  by gkamradt

LLM testing tool for evaluating in-context retrieval accuracy

created 1 year ago
1,956 stars

Top 22.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a framework for pressure-testing the in-context retrieval capabilities of large language models (LLMs) across various context lengths and document depths. It's designed for researchers and developers evaluating LLM performance, offering a systematic way to measure accuracy when a specific piece of information (the "needle") is embedded within a large body of text (the "haystack").

How It Works

The core approach involves placing a specific fact or statement (the "needle") at a defined position within a large text document (the "haystack"). The LLM is then prompted to retrieve this needle. The system iterates through different context lengths and "document depths" (the percentage of the context window where the needle is placed) to quantify retrieval accuracy. This method allows for direct comparison of how well different LLMs handle long contexts and precise information recall.

Quick Start & Requirements

  • Install via pip: pip install needlehaystack
  • Requires API keys for supported model providers (OpenAI, Anthropic, Cohere) set as environment variables (NIAH_MODEL_API_KEY, NIAH_EVALUATOR_API_KEY).
  • Run tests using the command-line entry point: needlehaystack.run_test
  • Example: needlehaystack.run_test --provider openai --model_name "gpt-3.5-turbo-0125" --document_depth_percents "[50]" --context_lengths "[2000]"
  • Official documentation and examples are available.

Highlighted Details

  • Supports OpenAI, Anthropic, and Cohere model providers.
  • Offers both model-based and LangSmith-based evaluation strategies.
  • Includes functionality for multi-needle insertion and distribution.
  • Provides visualization scripts for analyzing results.

Maintenance & Community

The project appears to be actively maintained by gkamradt. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive for commercial use and closed-source linking, requiring attribution.

Limitations & Caveats

The README notes that the script has been significantly upgraded since original tests, and data formats may not match older results. Saving contexts is warned against due to potential file size. Only text files are supported for the haystack.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
120 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Feedback? Help us improve.