reward-bench  by allenai

Reward model evaluation tool

created 1 year ago
620 stars

Top 54.0% on sourcepulse

GitHubView on GitHub
Project Summary

RewardBench is an evaluation tool for assessing the capabilities and safety of reward models (RMs) and models trained with Direct Preference Optimization (DPO). It provides a standardized framework for running inference, formatting datasets, and analyzing results, benefiting researchers and developers working on AI alignment and preference learning.

How It Works

RewardBench offers a unified interface for evaluating various RMs, including Starling, PairRM, OpenAssistant, and DPO models. It standardizes dataset formatting and inference procedures to ensure fair comparisons. The tool supports both direct RM evaluation and DPO model evaluation, automatically detecting instruction datasets for logging model outputs without accuracy metrics.

Quick Start & Requirements

  • Install: pip install rewardbench
  • Run: rewardbench --model={yourmodel} --dataset={yourdataset} --batch_size=8
  • Generative RMs: pip install rewardbench[generative] then rewardbench-gen --model={yourmodel}
  • Dependencies: VLLM and API access (OpenAI, Anthropic, Together) are required for local/API generative models.
  • Docs: RewardBench Dataset, Existing Test Sets, Results, Paper

Highlighted Details

  • Supports local and API-based generative RMs (LLM-as-a-judge).
  • Includes functionality for "Best of N" rankings and offline RM ensembling.
  • Features advanced logging and results uploading to Hugging Face Hub.
  • Provides scripts for running evaluations and submitting jobs via AI2's Beaker platform.

Maintenance & Community

The project is primarily maintained by Allen Institute for AI (AI2). Docker images are available for reproducible research. Contributions are welcomed via pull requests for inference stack enhancements.

Licensing & Compatibility

The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Support for loading local models using AutoModelForSequenceClassification.from_pretrained is marked as a TODO. Functionality for certain features, like direct metadata uploads for non-DPO models on preference datasets, may require opening an issue for enhancement.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
63 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.