RLHF-Reward-Modeling  by RLHFlow

Recipes to train reward models for RLHF

created 1 year ago
1,418 stars

Top 29.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive suite of recipes and code for training reward models (RMs) essential for Reinforcement Learning from Human Feedback (RLHF). It caters to researchers and practitioners in LLM alignment, offering implementations of various RM techniques, including Bradley-Terry, pairwise preference, semi-supervised, multi-objective (ArmoRM), and process/outcome-supervised methods. The project aims to facilitate reproducible and state-of-the-art reward modeling for RLHF pipelines.

How It Works

The project implements diverse reward modeling strategies, including the classic Bradley-Terry model, pairwise preference models that directly predict preference probabilities, and generative RMs that leverage next-token prediction. It also incorporates advanced techniques like Semi-Supervised Reward Modeling (SSRM) for data augmentation, ArmoRM for multi-objective rewards with context-dependent aggregation, and math-rm for process/outcome-supervised rewards. Decision-tree RMs are also included for interpretable preference modeling.

Quick Start & Requirements

  • Installation: Separate environments are recommended for different models. Instructions are provided within respective model folders (e.g., bradley-terry-rm, pair-pm).
  • Prerequisites: Requires Python and standard ML libraries. Specific hardware (e.g., 4x A40 48G or 4x A100 80G) is mentioned for training larger models with specific configurations like DeepSpeed Zero-3 and gradient checkpointing.
  • Data Format: Expects preference data with 'chosen' and 'rejected' conversations sharing the same prompt. Preprocessed datasets are available on Hugging Face.
  • Links:

Highlighted Details

  • Achieves state-of-the-art scores on RewardBench with models like ArmoRM-Llama3-8B-v0.1 (89.0) and Decision-Tree-Reward-Gemma-2-27B (95.4%).
  • Provides code for multiple RM architectures: Bradley-Terry, Pairwise Preference, ArmoRM, SSRM, math-rm, and decision-tree RMs.
  • Includes open-sourced data, code, hyperparameters, and models for reproducibility.
  • Supports training for DRL-based RLHF (PPO), Iterative SFT, and iterative DPO.

Maintenance & Community

  • Active development with recent releases for decision-tree, PRM/ORM, and ArmoRM.
  • Models and code have been contributed to numerous academic research projects.
  • Citation information and BibTeX entries are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. However, the models released (e.g., ArmoRM-Llama3-8B-v0.1) are subject to the base model's license (e.g., Llama 3 license). Compatibility for commercial use depends on the specific model and base LLM licenses.

Limitations & Caveats

  • The README does not specify a repository-wide license, potentially impacting commercial use.
  • Some advanced RM techniques (e.g., LLM-as-a-judge, Inverse-Q*) are listed under "To Do" or not yet implemented within the provided code structure.
Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
103 stars in the last 90 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code), Daniel Han Daniel Han(Cofounder of Unsloth), and
4 more.

open-instruct by allenai

0.2%
3k
Training codebase for instruction-following language models
created 2 years ago
updated 16 hours ago
Feedback? Help us improve.