Safe RLHF for constrained value alignment in LLMs
Top 27.9% on sourcepulse
This repository provides a modular framework for Reinforcement Learning from Human Feedback (RLHF), specifically focusing on "Safe RLHF" for constrained value alignment in Large Language Models (LLMs). It targets researchers and developers aiming to train LLMs that are both helpful and harmless, offering a comprehensive pipeline from Supervised Fine-Tuning (SFT) to RLHF and evaluation, along with a substantial human-labeled dataset.
How It Works
The core innovation is the integration of a "Cost Model" alongside the traditional "Reward Model" within the RLHF process. This allows for constrained optimization, aiming to maximize helpfulness (reward) while minimizing harmfulness (cost). The framework supports various pre-trained models like LLaMA and Baichuan, and utilizes DeepSpeed for efficient distributed training, including ZeRO-Offload for memory management.
Quick Start & Requirements
conda env create --file conda-recipe.yaml
) or use Docker (make docker-run
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The computational requirements for training are substantial, necessitating multiple high-end GPUs. While the framework supports various models, users may need to adapt scripts for specific hardware configurations or model paths. The dataset is released gradually.
1 year ago
1 day