MOSS-RLHF by OpenLMLab

RLHF research paper focusing on PPO and reward modeling

Created 2 years ago

1,411 stars

Top 28.5% on SourcePulse

View on GitHub

2 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Elvis Saravia

Founder of DAIR.AI

Project Summary

This repository provides code and models for training Large Language Models (LLMs) using Reinforcement Learning from Human Feedback (RLHF), specifically focusing on the Proximal Policy Optimization (PPO) algorithm. It aims to lower the barrier for researchers to implement stable RLHF training, offering insights into the PPO process and releasing custom reward models and datasets.

How It Works

The project implements the PPO-max algorithm, an enhancement to PPO designed for stable LLM training. It involves training a reward model (RM) to predict human preferences and then using this RM to fine-tune a policy model via PPO. The repository offers pre-trained reward models and policy models, along with code for both reward model training and PPO fine-tuning, facilitating a complete RLHF pipeline.

Quick Start & Requirements

Installation: Requires Python 3.8 and PyTorch 1.13.1. Conda is recommended for environment setup.
Dependencies: Includes transformers, accelerate, deepspeed, triton==1.0.0, and others. CUDA 11.7 is specified for PyTorch installation.
Model Recovery: Requires downloading weight diffs and merging them with base Llama-7B models to recover reward and policy models.
Resources: Training requires significant computational resources typical for LLM fine-tuning.
Links: Technical report I, Technical report II, hh-rlhf-strength-cleaned dataset.

Highlighted Details

Released competitive Chinese and English reward models with good cross-model generalization.
Proposed PPO-max algorithm for stable LLM training.
Released annotated hh-rlhf dataset with preference strength.
Offers pre-trained English SFT, reward, and policy models based on Llama-7B.

Maintenance & Community

The project has received the Best Paper Award at the NIPS 2023 Workshop on Instruction Tuning and Instruction Following. Recent updates include the release of reward model training code and the annotated hh-rlhf dataset.

Licensing & Compatibility

Code License: Apache 2.0
Data License: CC BY-NC 4.0
Model License: GNU AGPL 3.0
Compatibility: The AGPL 3.0 license for models may impose restrictions on commercial use or linking with closed-source software.

Limitations & Caveats

The Chinese SFT model is not currently released, requiring users to provide their own or a strong base model. Model recovery requires merging diff weights with base Llama-7B models, adding an extra setup step.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days