Reward model evaluation tool
Top 54.0% on sourcepulse
RewardBench is an evaluation tool for assessing the capabilities and safety of reward models (RMs) and models trained with Direct Preference Optimization (DPO). It provides a standardized framework for running inference, formatting datasets, and analyzing results, benefiting researchers and developers working on AI alignment and preference learning.
How It Works
RewardBench offers a unified interface for evaluating various RMs, including Starling, PairRM, OpenAssistant, and DPO models. It standardizes dataset formatting and inference procedures to ensure fair comparisons. The tool supports both direct RM evaluation and DPO model evaluation, automatically detecting instruction datasets for logging model outputs without accuracy metrics.
Quick Start & Requirements
pip install rewardbench
rewardbench --model={yourmodel} --dataset={yourdataset} --batch_size=8
pip install rewardbench[generative]
then rewardbench-gen --model={yourmodel}
Highlighted Details
Maintenance & Community
The project is primarily maintained by Allen Institute for AI (AI2). Docker images are available for reproducible research. Contributions are welcomed via pull requests for inference stack enhancements.
Licensing & Compatibility
The repository is licensed under the Apache-2.0 license, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Support for loading local models using AutoModelForSequenceClassification.from_pretrained
is marked as a TODO. Functionality for certain features, like direct metadata uploads for non-DPO models on preference datasets, may require opening an issue for enhancement.
1 month ago
1 day