codeneedle  by alexziskind1

LLM positional recall benchmark for long contexts

Created 1 month ago
263 stars

Top 96.8% on SourcePulse

GitHubView on GitHub
Project Summary

CodeNeedle: Positional Recall Benchmark for LLMs

CodeNeedle provides a benchmark for evaluating Large Language Models' (LLMs) positional recall accuracy within extended contexts. It targets researchers and engineers needing to quantify how well LLMs can verbatim reproduce specific code segments from large source corpora, offering a quantitative measure beyond simple retrieval.

How It Works

The benchmark "stuffs" a large source corpus into an LLM's context and prompts it to reproduce verbatim the first N lines of designated functions. This methodology specifically assesses positional recall under long-context conditions, differentiating it from basic entity lookup. Configuration is managed via TOML files for corpora (defining files, sampling) and models (specifying identifiers, parameters like max_tokens, temperature), enabling flexible and reproducible comparisons.

Quick Start & Requirements

  • Installation: Utilizes uv for Python environment management (uv venv, uv pip install -r requirements.txt). Scripts are run via uv run.
  • Running: Execute benchmarks with uv run python bench.py run --corpus <corpus_name> --model <model_name>.
  • Docker: An optional Docker setup is available via docker compose run --rm app for an interactive environment.
  • Prerequisites: Requires a Python environment. LLM inference demands models loaded with substantial context (e.g., LM Studio requires forcing context to 128K). Specific server configurations (llama.cpp, LM Studio, Ollama) have detailed notes on context handling and KV caching.
  • Links: benchmark_plan.md, configs/CONFIG_README.md, analysis/VIZ_README.md.

Highlighted Details

  • Positional Recall Benchmark: Quantifies verbatim code reproduction from long contexts.
  • Configurable Setup: TOML files allow flexible definition of test corpora and LLM parameters.
  • Multi-file Handling: Supports concatenating multiple files with explicit path markers and deduplicating name collisions.
  • Visualization: Generates Plotly HTML dashboards for analyzing benchmark results.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmap are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, which requires clarification for adoption decisions.

Limitations & Caveats

LM Studio users may encounter issues with reported context sizes and automatic context unloading. Reasoning-capable models often ignore API toggles meant to disable chain-of-thought processing, necessitating larger max_tokens budgets. The benchmark focuses strictly on verbatim recall and positional accuracy, not broader LLM reasoning or generation capabilities.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
2
Star History
263 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.