MOSS-RLHF  by OpenLMLab

RLHF research paper focusing on PPO and reward modeling

created 2 years ago
1,384 stars

Top 29.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides code and models for training Large Language Models (LLMs) using Reinforcement Learning from Human Feedback (RLHF), specifically focusing on the Proximal Policy Optimization (PPO) algorithm. It aims to lower the barrier for researchers to implement stable RLHF training, offering insights into the PPO process and releasing custom reward models and datasets.

How It Works

The project implements the PPO-max algorithm, an enhancement to PPO designed for stable LLM training. It involves training a reward model (RM) to predict human preferences and then using this RM to fine-tune a policy model via PPO. The repository offers pre-trained reward models and policy models, along with code for both reward model training and PPO fine-tuning, facilitating a complete RLHF pipeline.

Quick Start & Requirements

  • Installation: Requires Python 3.8 and PyTorch 1.13.1. Conda is recommended for environment setup.
  • Dependencies: Includes transformers, accelerate, deepspeed, triton==1.0.0, and others. CUDA 11.7 is specified for PyTorch installation.
  • Model Recovery: Requires downloading weight diffs and merging them with base Llama-7B models to recover reward and policy models.
  • Resources: Training requires significant computational resources typical for LLM fine-tuning.
  • Links: Technical report I, Technical report II, hh-rlhf-strength-cleaned dataset.

Highlighted Details

  • Released competitive Chinese and English reward models with good cross-model generalization.
  • Proposed PPO-max algorithm for stable LLM training.
  • Released annotated hh-rlhf dataset with preference strength.
  • Offers pre-trained English SFT, reward, and policy models based on Llama-7B.

Maintenance & Community

The project has received the Best Paper Award at the NIPS 2023 Workshop on Instruction Tuning and Instruction Following. Recent updates include the release of reward model training code and the annotated hh-rlhf dataset.

Licensing & Compatibility

  • Code License: Apache 2.0
  • Data License: CC BY-NC 4.0
  • Model License: GNU AGPL 3.0
  • Compatibility: The AGPL 3.0 license for models may impose restrictions on commercial use or linking with closed-source software.

Limitations & Caveats

The Chinese SFT model is not currently released, requiring users to provide their own or a strong base model. Model recovery requires merging diff weights with base Llama-7B models, adding an extra setup step.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.