reward-bench by allenai

Reward model evaluation tool

Created 2 years ago

676 stars

Top 50.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Lewis Tunstall

Research Engineer at Hugging Face

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Nathan Lambert

Research Scientist at AI2

Luca Soldaini

Research Scientist at Ai2

Project Summary

RewardBench is an evaluation tool for assessing the capabilities and safety of reward models (RMs) and models trained with Direct Preference Optimization (DPO). It provides a standardized framework for running inference, formatting datasets, and analyzing results, benefiting researchers and developers working on AI alignment and preference learning.

How It Works

RewardBench offers a unified interface for evaluating various RMs, including Starling, PairRM, OpenAssistant, and DPO models. It standardizes dataset formatting and inference procedures to ensure fair comparisons. The tool supports both direct RM evaluation and DPO model evaluation, automatically detecting instruction datasets for logging model outputs without accuracy metrics.

Quick Start & Requirements

Install: pip install rewardbench
Run: rewardbench --model={yourmodel} --dataset={yourdataset} --batch_size=8
Generative RMs: pip install rewardbench[generative] then rewardbench-gen --model={yourmodel}
Dependencies: VLLM and API access (OpenAI, Anthropic, Together) are required for local/API generative models.
Docs: RewardBench Dataset, Existing Test Sets, Results, Paper

Highlighted Details

Supports local and API-based generative RMs (LLM-as-a-judge).
Includes functionality for "Best of N" rankings and offline RM ensembling.
Features advanced logging and results uploading to Hugging Face Hub.
Provides scripts for running evaluations and submitting jobs via AI2's Beaker platform.

Maintenance & Community

The project is primarily maintained by Allen Institute for AI (AI2). Docker images are available for reproducible research. Contributions are welcomed via pull requests for inference stack enhancements.

Licensing & Compatibility

The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Support for loading local models using AutoModelForSequenceClassification.from_pretrained is marked as a TODO. Functionality for certain features, like direct metadata uploads for non-DPO models on preference datasets, may require opening an issue for enhancement.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days