RLHF-Reward-Modeling by RLHFlow

Recipes to train reward models for RLHF

Created 1 year ago

1,494 stars

Top 27.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Omar Sanseviero

DevRel at Google DeepMind

Nathan Lambert

Research Scientist at AI2

Project Summary

This repository provides a comprehensive suite of recipes and code for training reward models (RMs) essential for Reinforcement Learning from Human Feedback (RLHF). It caters to researchers and practitioners in LLM alignment, offering implementations of various RM techniques, including Bradley-Terry, pairwise preference, semi-supervised, multi-objective (ArmoRM), and process/outcome-supervised methods. The project aims to facilitate reproducible and state-of-the-art reward modeling for RLHF pipelines.

How It Works

The project implements diverse reward modeling strategies, including the classic Bradley-Terry model, pairwise preference models that directly predict preference probabilities, and generative RMs that leverage next-token prediction. It also incorporates advanced techniques like Semi-Supervised Reward Modeling (SSRM) for data augmentation, ArmoRM for multi-objective rewards with context-dependent aggregation, and math-rm for process/outcome-supervised rewards. Decision-tree RMs are also included for interpretable preference modeling.

Quick Start & Requirements

Installation: Separate environments are recommended for different models. Instructions are provided within respective model folders (e.g., bradley-terry-rm, pair-pm).
Prerequisites: Requires Python and standard ML libraries. Specific hardware (e.g., 4x A40 48G or 4x A100 80G) is mentioned for training larger models with specific configurations like DeepSpeed Zero-3 and gradient checkpointing.
Data Format: Expects preference data with 'chosen' and 'rejected' conversations sharing the same prompt. Preprocessed datasets are available on Hugging Face.
Links:
- RewardBench Leaderboard: https://huggingface.co/spaces/RLHF-Rewards/RewardBench
- Tech Reports: Linked in README.

Highlighted Details

Achieves state-of-the-art scores on RewardBench with models like ArmoRM-Llama3-8B-v0.1 (89.0) and Decision-Tree-Reward-Gemma-2-27B (95.4%).
Provides code for multiple RM architectures: Bradley-Terry, Pairwise Preference, ArmoRM, SSRM, math-rm, and decision-tree RMs.
Includes open-sourced data, code, hyperparameters, and models for reproducibility.
Supports training for DRL-based RLHF (PPO), Iterative SFT, and iterative DPO.

Maintenance & Community

Active development with recent releases for decision-tree, PRM/ORM, and ArmoRM.
Models and code have been contributed to numerous academic research projects.
Citation information and BibTeX entries are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. However, the models released (e.g., ArmoRM-Llama3-8B-v0.1) are subject to the base model's license (e.g., Llama 3 license). Compatibility for commercial use depends on the specific model and base LLM licenses.

Limitations & Caveats

The README does not specify a repository-wide license, potentially impacting commercial use.
Some advanced RM techniques (e.g., LLM-as-a-judge, Inverse-Q*) are listed under "To Do" or not yet implemented within the provided code structure.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days