RULER by NVIDIA

Evaluation suite for long-context language models research paper

Created 1 year ago

1,420 stars

Top 28.4% on SourcePulse

View on GitHub

4 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Didier Lopes

Founder of OpenBB

Pawel Garbacki

Cofounder of Fireworks AI

Ying Sheng

Coauthor of SGLang

Project Summary

RULER is a framework for evaluating the true context window capabilities of long-context language models. It generates synthetic data across various task complexities to benchmark model performance beyond simple recall, aiding researchers and developers in understanding model limitations.

How It Works

RULER employs a synthetic data generation pipeline to create evaluation datasets. It systematically varies sequence length and task complexity, allowing for granular analysis of model degradation. The framework benchmarks models against tasks like needle-in-a-haystack, variable tracking, and word extraction, providing quantitative metrics on effective context length.

Quick Start & Requirements

Docker: docker pull cphsieh/ruler:0.2.0 followed by docker build -f Dockerfile -t cphsieh/ruler:0.2.0 .
Prerequisites: NVIDIA PyTorch container (nvcr.io/nvidia/pytorch:23.10-py3), Python, Hugging Face models, optionally TensorRT-LLM.
Data: Download datasets via provided scripts (download_paulgraham_essay.py, download_qa_dataset.sh).
Setup: Requires configuring run.sh and config_models.sh with model paths and types.
Docs: https://github.com/NVIDIA/RULER

Highlighted Details

Benchmarks 17 open-source models across 13 tasks in 4 categories.
Demonstrates significant performance degradation in most models beyond 32K sequence length, even with high claimed context windows.
Provides a configurable testbed for creating custom evaluation tasks.

Maintenance & Community

This is a research project from NVIDIA. Further community contributions for new tasks are welcomed.

Licensing & Compatibility

License: Apache 2.0 (as per typical NVIDIA research repos, though not explicitly stated in README).
Compatibility: Primarily for research purposes. Commercial use depends on underlying model licenses.

Limitations & Caveats

The current RULER tasks are designed for models that perform well at short contexts; more complex tasks where models struggle even at short lengths are not included. The evaluation is not exhaustive for all models and task configurations. The framework does not replace realistic task evaluations.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days