Evaluation suite for long-context language models research paper
Top 33.0% on sourcepulse
RULER is a framework for evaluating the true context window capabilities of long-context language models. It generates synthetic data across various task complexities to benchmark model performance beyond simple recall, aiding researchers and developers in understanding model limitations.
How It Works
RULER employs a synthetic data generation pipeline to create evaluation datasets. It systematically varies sequence length and task complexity, allowing for granular analysis of model degradation. The framework benchmarks models against tasks like needle-in-a-haystack, variable tracking, and word extraction, providing quantitative metrics on effective context length.
Quick Start & Requirements
docker pull cphsieh/ruler:0.2.0
followed by docker build -f Dockerfile -t cphsieh/ruler:0.2.0 .
nvcr.io/nvidia/pytorch:23.10-py3
), Python, Hugging Face models, optionally TensorRT-LLM.download_paulgraham_essay.py
, download_qa_dataset.sh
).run.sh
and config_models.sh
with model paths and types.Highlighted Details
Maintenance & Community
This is a research project from NVIDIA. Further community contributions for new tasks are welcomed.
Licensing & Compatibility
Limitations & Caveats
The current RULER tasks are designed for models that perform well at short contexts; more complex tasks where models struggle even at short lengths are not included. The evaluation is not exhaustive for all models and task configurations. The framework does not replace realistic task evaluations.
1 week ago
1 day