context-rot by chroma-core

LLM Context Rot Evaluation Toolkit

Created 9 months ago

251 stars

Top 99.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Shawn Wang

Editor of Latent Space

Lewis Tunstall

Research Engineer at Hugging Face

Project Summary

Summary

This repository offers a toolkit for replicating research on 'Context Rot,' a phenomenon where Large Language Model (LLM) performance degrades significantly as input token length increases. It addresses the assumption of uniform context processing, providing researchers and engineers with the means to reproduce and analyze LLM behavior across varying input sizes, thereby quantifying performance limitations.

How It Works

The project provides a toolkit for replicating experiments investigating LLM context rot. It organizes three key experimental setups: NIAH Extension (semantic/lexical matches), LongMemEval (long-context memory), and Repeated Words (sequence replication). These experiments systematically measure LLM performance degradation as input token count increases, highlighting variations beyond simple lexical recall.

Quick Start & Requirements

To set up, clone the repository, create and activate a Python virtual environment, and install dependencies using pip install -r requirements.txt. Environment variables for API keys (OpenAI, Anthropic, Google) are required. Users must then navigate to specific experiment folders and follow their respective README instructions. No specific hardware, OS, or estimated setup time is detailed. Links to the technical report (https://research.trychroma.com/context-rot) and datasets are provided.

Highlighted Details

NIAH Extension: Explores semantic vs. lexical "needle" matches and haystack variations in "Needle in a Haystack" tasks.
LongMemEval: A dedicated task for evaluating LLM performance on long-context memory recall.
Repeated Words: Tests model reliability in replicating sequences of repeated words as context length grows.

Maintenance & Community

The provided README does not contain information regarding notable contributors, sponsorships, partnerships, community channels (like Discord or Slack), or a public roadmap.

Licensing & Compatibility

The repository's README does not specify a software license. Therefore, its terms for use, modification, and distribution, particularly for commercial purposes or integration into closed-source projects, are unclear.

Limitations & Caveats

The toolkit is primarily designed for replicating specific experimental results and may require adaptation for broader LLM evaluation. The core problem it investigates, "context rot," implies that current LLM architectures may struggle with maintaining consistent performance across extended input contexts, a limitation inherent to the models themselves rather than the toolkit.

Health Check

Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days