safe-rlhf  by PKU-Alignment

Safe RLHF for constrained value alignment in LLMs

created 2 years ago
1,512 stars

Top 27.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a modular framework for Reinforcement Learning from Human Feedback (RLHF), specifically focusing on "Safe RLHF" for constrained value alignment in Large Language Models (LLMs). It targets researchers and developers aiming to train LLMs that are both helpful and harmless, offering a comprehensive pipeline from Supervised Fine-Tuning (SFT) to RLHF and evaluation, along with a substantial human-labeled dataset.

How It Works

The core innovation is the integration of a "Cost Model" alongside the traditional "Reward Model" within the RLHF process. This allows for constrained optimization, aiming to maximize helpfulness (reward) while minimizing harmfulness (cost). The framework supports various pre-trained models like LLaMA and Baichuan, and utilizes DeepSpeed for efficient distributed training, including ZeRO-Offload for memory management.

Quick Start & Requirements

  • Install: Clone the repository and set up a conda environment (conda env create --file conda-recipe.yaml) or use Docker (make docker-run).
  • Prerequisites: Python, Conda/Mamba, DeepSpeed, NVIDIA Container Toolkit (for Docker). Training requires significant GPU resources (tested with 8 x NVIDIA A800-80GB GPUs for LLaMA-7B).
  • Resources: Training scripts include options for DeepSpeed ZeRO-Offload to manage GPU memory.
  • Docs: PKU-SafeRLHF dataset on Hugging Face

Highlighted Details

  • Supports SFT, RLHF, and Safe RLHF for models like LLaMA, OPT, Baichuan.
  • Offers a large human-labeled dataset (PKU-SafeRLHF-1M) with safety preferences across multiple harm categories.
  • Includes pre-trained checkpoints for Reward and Cost Models.
  • Provides multi-scale safety verification metrics (BIG-bench, GPT-4 Evaluation).
  • First framework to incorporate safety preferences directly into the RLHF stage with theoretical guarantees.

Maintenance & Community

  • The project is from the PKU-Alignment team at Peking University.
  • Key contributions acknowledged from LLaMA, Stanford Alpaca, DeepSpeed, and DeepSpeed-Chat.
  • Future plans include releasing larger datasets, training larger LLMs, and supporting memory-efficient training methods like LoRA.

Licensing & Compatibility

  • Released under Apache License 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The computational requirements for training are substantial, necessitating multiple high-end GPUs. While the framework supports various models, users may need to adapt scripts for specific hardware configurations or model paths. The dataset is released gradually.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
60 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.