RULER  by NVIDIA

Evaluation suite for long-context language models research paper

Created 1 year ago
1,420 stars

Top 28.4% on SourcePulse

GitHubView on GitHub
Project Summary

RULER is a framework for evaluating the true context window capabilities of long-context language models. It generates synthetic data across various task complexities to benchmark model performance beyond simple recall, aiding researchers and developers in understanding model limitations.

How It Works

RULER employs a synthetic data generation pipeline to create evaluation datasets. It systematically varies sequence length and task complexity, allowing for granular analysis of model degradation. The framework benchmarks models against tasks like needle-in-a-haystack, variable tracking, and word extraction, providing quantitative metrics on effective context length.

Quick Start & Requirements

  • Docker: docker pull cphsieh/ruler:0.2.0 followed by docker build -f Dockerfile -t cphsieh/ruler:0.2.0 .
  • Prerequisites: NVIDIA PyTorch container (nvcr.io/nvidia/pytorch:23.10-py3), Python, Hugging Face models, optionally TensorRT-LLM.
  • Data: Download datasets via provided scripts (download_paulgraham_essay.py, download_qa_dataset.sh).
  • Setup: Requires configuring run.sh and config_models.sh with model paths and types.
  • Docs: https://github.com/NVIDIA/RULER

Highlighted Details

  • Benchmarks 17 open-source models across 13 tasks in 4 categories.
  • Demonstrates significant performance degradation in most models beyond 32K sequence length, even with high claimed context windows.
  • Provides a configurable testbed for creating custom evaluation tasks.

Maintenance & Community

This is a research project from NVIDIA. Further community contributions for new tasks are welcomed.

Licensing & Compatibility

  • License: Apache 2.0 (as per typical NVIDIA research repos, though not explicitly stated in README).
  • Compatibility: Primarily for research purposes. Commercial use depends on underlying model licenses.

Limitations & Caveats

The current RULER tasks are designed for models that perform well at short contexts; more complex tasks where models struggle even at short lengths are not included. The evaluation is not exhaustive for all models and task configurations. The framework does not replace realistic task evaluations.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
35 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
17 more.

simple-evals by openai

0.7%
4k
Lightweight library for evaluating language models
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.