safe-rlhf by PKU-Alignment

Safe RLHF for constrained value alignment in LLMs

Created 2 years ago

1,573 stars

Top 26.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

This repository provides a modular framework for Reinforcement Learning from Human Feedback (RLHF), specifically focusing on "Safe RLHF" for constrained value alignment in Large Language Models (LLMs). It targets researchers and developers aiming to train LLMs that are both helpful and harmless, offering a comprehensive pipeline from Supervised Fine-Tuning (SFT) to RLHF and evaluation, along with a substantial human-labeled dataset.

How It Works

The core innovation is the integration of a "Cost Model" alongside the traditional "Reward Model" within the RLHF process. This allows for constrained optimization, aiming to maximize helpfulness (reward) while minimizing harmfulness (cost). The framework supports various pre-trained models like LLaMA and Baichuan, and utilizes DeepSpeed for efficient distributed training, including ZeRO-Offload for memory management.

Quick Start & Requirements

Install: Clone the repository and set up a conda environment (conda env create --file conda-recipe.yaml) or use Docker (make docker-run).
Prerequisites: Python, Conda/Mamba, DeepSpeed, NVIDIA Container Toolkit (for Docker). Training requires significant GPU resources (tested with 8 x NVIDIA A800-80GB GPUs for LLaMA-7B).
Resources: Training scripts include options for DeepSpeed ZeRO-Offload to manage GPU memory.
Docs: PKU-SafeRLHF dataset on Hugging Face

Highlighted Details

Supports SFT, RLHF, and Safe RLHF for models like LLaMA, OPT, Baichuan.
Offers a large human-labeled dataset (PKU-SafeRLHF-1M) with safety preferences across multiple harm categories.
Includes pre-trained checkpoints for Reward and Cost Models.
Provides multi-scale safety verification metrics (BIG-bench, GPT-4 Evaluation).
First framework to incorporate safety preferences directly into the RLHF stage with theoretical guarantees.

Maintenance & Community

The project is from the PKU-Alignment team at Peking University.
Key contributions acknowledged from LLaMA, Stanford Alpaca, DeepSpeed, and DeepSpeed-Chat.
Future plans include releasing larger datasets, training larger LLMs, and supporting memory-efficient training methods like LoRA.

Licensing & Compatibility

Released under Apache License 2.0.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The computational requirements for training are substantial, necessitating multiple high-end GPUs. While the framework supports various models, users may need to adapt scripts for specific hardware configurations or model paths. The dataset is released gradually.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days