codeneedle by alexziskind1

LLM positional recall benchmark for long contexts

Created 2 months ago

301 stars

Top 88.4% on SourcePulse

Project Summary

CodeNeedle: Positional Recall Benchmark for LLMs

CodeNeedle provides a benchmark for evaluating Large Language Models' (LLMs) positional recall accuracy within extended contexts. It targets researchers and engineers needing to quantify how well LLMs can verbatim reproduce specific code segments from large source corpora, offering a quantitative measure beyond simple retrieval.

How It Works

The benchmark "stuffs" a large source corpus into an LLM's context and prompts it to reproduce verbatim the first N lines of designated functions. This methodology specifically assesses positional recall under long-context conditions, differentiating it from basic entity lookup. Configuration is managed via TOML files for corpora (defining files, sampling) and models (specifying identifiers, parameters like max_tokens, temperature), enabling flexible and reproducible comparisons.

Quick Start & Requirements

Installation: Utilizes uv for Python environment management (uv venv, uv pip install -r requirements.txt). Scripts are run via uv run.
Running: Execute benchmarks with uv run python bench.py run --corpus <corpus_name> --model <model_name>.
Docker: An optional Docker setup is available via docker compose run --rm app for an interactive environment.
Prerequisites: Requires a Python environment. LLM inference demands models loaded with substantial context (e.g., LM Studio requires forcing context to 128K). Specific server configurations (llama.cpp, LM Studio, Ollama) have detailed notes on context handling and KV caching.
Links: benchmark_plan.md, configs/CONFIG_README.md, analysis/VIZ_README.md.

Highlighted Details

Positional Recall Benchmark: Quantifies verbatim code reproduction from long contexts.
Configurable Setup: TOML files allow flexible definition of test corpora and LLM parameters.
Multi-file Handling: Supports concatenating multiple files with explicit path markers and deduplicating name collisions.
Visualization: Generates Plotly HTML dashboards for analyzing benchmark results.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or roadmap are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README, which requires clarification for adoption decisions.

Limitations & Caveats

LM Studio users may encounter issues with reported context sizes and automatic context unloading. Reasoning-capable models often ignore API toggles meant to disable chain-of-thought processing, necessitating larger max_tokens budgets. The benchmark focuses strictly on verbatim recall and positional accuracy, not broader LLM reasoning or generation capabilities.

codeneedle by alexziskind1

Explore Similar Projects

ML-Bench by gersteinlab

llm_benchmark by llm2014

context-rot by chroma-core

RGB by chen700564

code-eval by abacaj

langchain-benchmarks by langchain-ai

fmeval by aws

naturalcc by CGCL-codes

airoboros by jondurbin

Awesome-Code-LLM by huybery

Awesome-LLM-Eval by onejune2018

FastCode by HKUDS