Discover and explore top open-source AI tools and projects—updated daily.
alexziskind1LLM positional recall benchmark for long contexts
Top 96.8% on SourcePulse
CodeNeedle: Positional Recall Benchmark for LLMs
CodeNeedle provides a benchmark for evaluating Large Language Models' (LLMs) positional recall accuracy within extended contexts. It targets researchers and engineers needing to quantify how well LLMs can verbatim reproduce specific code segments from large source corpora, offering a quantitative measure beyond simple retrieval.
How It Works
The benchmark "stuffs" a large source corpus into an LLM's context and prompts it to reproduce verbatim the first N lines of designated functions. This methodology specifically assesses positional recall under long-context conditions, differentiating it from basic entity lookup. Configuration is managed via TOML files for corpora (defining files, sampling) and models (specifying identifiers, parameters like max_tokens, temperature), enabling flexible and reproducible comparisons.
Quick Start & Requirements
uv for Python environment management (uv venv, uv pip install -r requirements.txt). Scripts are run via uv run.uv run python bench.py run --corpus <corpus_name> --model <model_name>.docker compose run --rm app for an interactive environment.benchmark_plan.md, configs/CONFIG_README.md, analysis/VIZ_README.md.Highlighted Details
Maintenance & Community
No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmap are provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README, which requires clarification for adoption decisions.
Limitations & Caveats
LM Studio users may encounter issues with reported context sizes and automatic context unloading. Reasoning-capable models often ignore API toggles meant to disable chain-of-thought processing, necessitating larger max_tokens budgets. The benchmark focuses strictly on verbatim recall and positional accuracy, not broader LLM reasoning or generation capabilities.
1 week ago
Inactive
chroma-core
huybery