RULER  by NVIDIA

Evaluation suite for long-context language models research paper

created 1 year ago
1,213 stars

Top 33.0% on sourcepulse

GitHubView on GitHub
Project Summary

RULER is a framework for evaluating the true context window capabilities of long-context language models. It generates synthetic data across various task complexities to benchmark model performance beyond simple recall, aiding researchers and developers in understanding model limitations.

How It Works

RULER employs a synthetic data generation pipeline to create evaluation datasets. It systematically varies sequence length and task complexity, allowing for granular analysis of model degradation. The framework benchmarks models against tasks like needle-in-a-haystack, variable tracking, and word extraction, providing quantitative metrics on effective context length.

Quick Start & Requirements

  • Docker: docker pull cphsieh/ruler:0.2.0 followed by docker build -f Dockerfile -t cphsieh/ruler:0.2.0 .
  • Prerequisites: NVIDIA PyTorch container (nvcr.io/nvidia/pytorch:23.10-py3), Python, Hugging Face models, optionally TensorRT-LLM.
  • Data: Download datasets via provided scripts (download_paulgraham_essay.py, download_qa_dataset.sh).
  • Setup: Requires configuring run.sh and config_models.sh with model paths and types.
  • Docs: https://github.com/NVIDIA/RULER

Highlighted Details

  • Benchmarks 17 open-source models across 13 tasks in 4 categories.
  • Demonstrates significant performance degradation in most models beyond 32K sequence length, even with high claimed context windows.
  • Provides a configurable testbed for creating custom evaluation tasks.

Maintenance & Community

This is a research project from NVIDIA. Further community contributions for new tasks are welcomed.

Licensing & Compatibility

  • License: Apache 2.0 (as per typical NVIDIA research repos, though not explicitly stated in README).
  • Compatibility: Primarily for research purposes. Commercial use depends on underlying model licenses.

Limitations & Caveats

The current RULER tasks are designed for models that perform well at short contexts; more complex tasks where models struggle even at short lengths are not included. The evaluation is not exhaustive for all models and task configurations. The framework does not replace realistic task evaluations.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
144 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Feedback? Help us improve.