RLHF research paper focusing on PPO and reward modeling
Top 29.8% on sourcepulse
This repository provides code and models for training Large Language Models (LLMs) using Reinforcement Learning from Human Feedback (RLHF), specifically focusing on the Proximal Policy Optimization (PPO) algorithm. It aims to lower the barrier for researchers to implement stable RLHF training, offering insights into the PPO process and releasing custom reward models and datasets.
How It Works
The project implements the PPO-max algorithm, an enhancement to PPO designed for stable LLM training. It involves training a reward model (RM) to predict human preferences and then using this RM to fine-tune a policy model via PPO. The repository offers pre-trained reward models and policy models, along with code for both reward model training and PPO fine-tuning, facilitating a complete RLHF pipeline.
Quick Start & Requirements
transformers
, accelerate
, deepspeed
, triton==1.0.0
, and others. CUDA 11.7 is specified for PyTorch installation.Highlighted Details
Maintenance & Community
The project has received the Best Paper Award at the NIPS 2023 Workshop on Instruction Tuning and Instruction Following. Recent updates include the release of reward model training code and the annotated hh-rlhf dataset.
Licensing & Compatibility
Limitations & Caveats
The Chinese SFT model is not currently released, requiring users to provide their own or a strong base model. Model recovery requires merging diff weights with base Llama-7B models, adding an extra setup step.
1 year ago
1 week