safe-rlhf  by PKU-Alignment

Safe RLHF for constrained value alignment in LLMs

Created 2 years ago
1,528 stars

Top 27.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a modular framework for Reinforcement Learning from Human Feedback (RLHF), specifically focusing on "Safe RLHF" for constrained value alignment in Large Language Models (LLMs). It targets researchers and developers aiming to train LLMs that are both helpful and harmless, offering a comprehensive pipeline from Supervised Fine-Tuning (SFT) to RLHF and evaluation, along with a substantial human-labeled dataset.

How It Works

The core innovation is the integration of a "Cost Model" alongside the traditional "Reward Model" within the RLHF process. This allows for constrained optimization, aiming to maximize helpfulness (reward) while minimizing harmfulness (cost). The framework supports various pre-trained models like LLaMA and Baichuan, and utilizes DeepSpeed for efficient distributed training, including ZeRO-Offload for memory management.

Quick Start & Requirements

  • Install: Clone the repository and set up a conda environment (conda env create --file conda-recipe.yaml) or use Docker (make docker-run).
  • Prerequisites: Python, Conda/Mamba, DeepSpeed, NVIDIA Container Toolkit (for Docker). Training requires significant GPU resources (tested with 8 x NVIDIA A800-80GB GPUs for LLaMA-7B).
  • Resources: Training scripts include options for DeepSpeed ZeRO-Offload to manage GPU memory.
  • Docs: PKU-SafeRLHF dataset on Hugging Face

Highlighted Details

  • Supports SFT, RLHF, and Safe RLHF for models like LLaMA, OPT, Baichuan.
  • Offers a large human-labeled dataset (PKU-SafeRLHF-1M) with safety preferences across multiple harm categories.
  • Includes pre-trained checkpoints for Reward and Cost Models.
  • Provides multi-scale safety verification metrics (BIG-bench, GPT-4 Evaluation).
  • First framework to incorporate safety preferences directly into the RLHF stage with theoretical guarantees.

Maintenance & Community

  • The project is from the PKU-Alignment team at Peking University.
  • Key contributions acknowledged from LLaMA, Stanford Alpaca, DeepSpeed, and DeepSpeed-Chat.
  • Future plans include releasing larger datasets, training larger LLMs, and supporting memory-efficient training methods like LoRA.

Licensing & Compatibility

  • Released under Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The computational requirements for training are substantial, necessitating multiple high-end GPUs. While the framework supports various models, users may need to adapt scripts for specific hardware configurations or model paths. The dataset is released gradually.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
19 more.

trlx by CarperAI

0.0%
5k
Distributed RLHF for LLMs
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.