Online-RLHF by RLHFlow

Recipe for online iterative RLHF to align LLMs

Created 1 year ago

539 stars

Top 58.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides a comprehensive "recipe" for implementing online Reinforcement Learning from Human Feedback (RLHF) and iterative Direct Preference Optimization (DPO) for aligning Large Language Models (LLMs). It targets researchers and practitioners aiming to replicate state-of-the-art online RLHF workflows, offering a practical guide to achieve results comparable to or exceeding leading models like LLaMA-3-8B-Instruct using open-source components.

How It Works

The workflow follows a three-step process: Supervised Fine-Tuning (SFT) of a base LLM, training a Reward Model (RM) using pairwise preferences, and then iteratively refining the LLM using RLHF or DPO with generated data. The project emphasizes an "online" approach, suggesting continuous improvement through iterative feedback loops, which is claimed to outperform offline methods. It leverages existing frameworks like Axolotl for SFT and provides scripts for data generation, reward modeling, and the final RLHF/DPO training stages.

Quick Start & Requirements

Installation: Requires separate Conda environments for SFT, inference, and training. Key dependencies include torch==2.1.2, flash-attn==2.6.3, accelerate, deepspeed, transformers, vllm, wandb, and nvidia-ml-py3. Specific versions are recommended to avoid compatibility issues (e.g., numpy<2.0).
Prerequisites: CUDA 12.1/12.2 is tested. Access to LLaMA-3 models requires huggingface-cli login.
Setup: Involves cloning repositories (Axolotl, FastChat, RLHF-Reward-Modeling), installing dependencies, and potentially modifying code (e.g., axolotl/src/axolotl/utils/bench.py).
Resources: Training typically requires multiple GPUs (e.g., 8x GPUs for SFT).
Links: Axolotl, FastChat, RLHF-Reward-Modeling, Alignment Handbook.

Highlighted Details

Achieves competitive results against LLaMA-3-8B-Instruct using open-source data and models.
Provides pre-trained SFT, Reward Models (Bradley-Terry, generative pairwise, mixture-of-expert), and RLHF/DPO models on Hugging Face.
Includes detailed scripts for data generation using VLLM (both direct inference and API server modes) and data annotation.
Offers a run_loop2.sh script to automate the entire iterative training process.

Maintenance & Community

The project acknowledges contributions from various Hugging Face teams (TRL, H4), Meta LLaMA, Allen Institute AI, evalplus, and Axolotl. Community links are not explicitly provided in the README.

Licensing & Compatibility

The repository itself does not specify a license. However, it heavily relies on and integrates with other projects, including Hugging Face libraries and models, which have their own licenses. Compatibility for commercial use would depend on the licenses of the underlying models and frameworks used.

Limitations & Caveats

Strict version requirements for dependencies like numpy and torch are noted, indicating potential compatibility challenges.
The setup process involves manual code modifications and environment management, which can be complex.
The "online" nature implies continuous data generation and retraining, requiring significant computational resources and ongoing effort.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days