This repository provides a comprehensive suite of recipes and code for training reward models (RMs) essential for Reinforcement Learning from Human Feedback (RLHF). It caters to researchers and practitioners in LLM alignment, offering implementations of various RM techniques, including Bradley-Terry, pairwise preference, semi-supervised, multi-objective (ArmoRM), and process/outcome-supervised methods. The project aims to facilitate reproducible and state-of-the-art reward modeling for RLHF pipelines.
How It Works
The project implements diverse reward modeling strategies, including the classic Bradley-Terry model, pairwise preference models that directly predict preference probabilities, and generative RMs that leverage next-token prediction. It also incorporates advanced techniques like Semi-Supervised Reward Modeling (SSRM) for data augmentation, ArmoRM for multi-objective rewards with context-dependent aggregation, and math-rm for process/outcome-supervised rewards. Decision-tree RMs are also included for interpretable preference modeling.
Quick Start & Requirements
- Installation: Separate environments are recommended for different models. Instructions are provided within respective model folders (e.g.,
bradley-terry-rm
, pair-pm
).
- Prerequisites: Requires Python and standard ML libraries. Specific hardware (e.g., 4x A40 48G or 4x A100 80G) is mentioned for training larger models with specific configurations like DeepSpeed Zero-3 and gradient checkpointing.
- Data Format: Expects preference data with 'chosen' and 'rejected' conversations sharing the same prompt. Preprocessed datasets are available on Hugging Face.
- Links:
Highlighted Details
- Achieves state-of-the-art scores on RewardBench with models like ArmoRM-Llama3-8B-v0.1 (89.0) and Decision-Tree-Reward-Gemma-2-27B (95.4%).
- Provides code for multiple RM architectures: Bradley-Terry, Pairwise Preference, ArmoRM, SSRM, math-rm, and decision-tree RMs.
- Includes open-sourced data, code, hyperparameters, and models for reproducibility.
- Supports training for DRL-based RLHF (PPO), Iterative SFT, and iterative DPO.
Maintenance & Community
- Active development with recent releases for decision-tree, PRM/ORM, and ArmoRM.
- Models and code have been contributed to numerous academic research projects.
- Citation information and BibTeX entries are provided.
Licensing & Compatibility
- The repository itself does not explicitly state a license in the README. However, the models released (e.g., ArmoRM-Llama3-8B-v0.1) are subject to the base model's license (e.g., Llama 3 license). Compatibility for commercial use depends on the specific model and base LLM licenses.
Limitations & Caveats
- The README does not specify a repository-wide license, potentially impacting commercial use.
- Some advanced RM techniques (e.g., LLM-as-a-judge, Inverse-Q*) are listed under "To Do" or not yet implemented within the provided code structure.