RLHF alternative for training socially aligned language models
Top 80.4% on sourcepulse
This repository provides an alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning language models, focusing on efficiency and stability. It targets researchers and developers seeking to train socially aligned LLMs by leveraging simulated human society interactions. The core benefit is a potentially more robust and less gameable alignment process.
How It Works
The project bypasses traditional reward modeling by directly training on interaction data generated within a simulated social environment ("Sandbox"). This approach uses multi-agent simulations where language models act as social agents, interacting and generating data. This data is then used for alignment training, aiming for higher quality and stability compared to RLHF.
Quick Start & Requirements
pip install -r requirements.txt
and pip install -e .
.env
).text-davinci-002
/003
and gpt-3.5-turbo
or GPT4
for simulation agents.torchrun
with FSDP. BF16 support is recommended.assets/sandbox_v1.json
(93.8k samples) and assets/sandbox_v2.json
(169k samples). Full dataset available upon request.Highlighted Details
better-base
, hh-rlhf-sft
, and socially-good-lm
.Maintenance & Community
The project is associated with the paper "Training Socially Aligned Language Models in Simulated Human Society" by Liu et al. (2023). No specific community channels or active maintenance signals are detailed in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The code and data are presented for research purposes, and commercial use would require clarification.
Limitations & Caveats
The project relies heavily on OpenAI's API for simulation agents, incurring costs and external dependencies. The "Stable Alignment" method's generalizability and robustness beyond the described simulations require further validation. The training process requires significant computational resources and specific FSDP configurations.
2 years ago
Inactive