Stable-Alignment by agi-templar

RLHF alternative for training socially aligned language models

Created 2 years ago

353 stars

Top 79.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides an alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning language models, focusing on efficiency and stability. It targets researchers and developers seeking to train socially aligned LLMs by leveraging simulated human society interactions. The core benefit is a potentially more robust and less gameable alignment process.

How It Works

The project bypasses traditional reward modeling by directly training on interaction data generated within a simulated social environment ("Sandbox"). This approach uses multi-agent simulations where language models act as social agents, interacting and generating data. This data is then used for alignment training, aiming for higher quality and stability compared to RLHF.

Quick Start & Requirements

Install: pip install -r requirements.txt and pip install -e .
Prerequisites: Python, Git LFS, OpenAI API key (placed in .env).
Simulation: Requires text-davinci-002/003 and gpt-3.5-turbo or GPT4 for simulation agents.
Training: Requires a supervised fine-tuned (SFT) model and uses torchrun with FSDP. BF16 support is recommended.
Inference: Requires a downloaded model and PyTorch.
Data: Includes assets/sandbox_v1.json (93.8k samples) and assets/sandbox_v2.json (169k samples). Full dataset available upon request.
Docs: Official Paper

Highlighted Details

Offers a "So(cially)-Good Language Model" trained via the Stable Alignment method.
Provides code for running multi-agent social simulations in "Sandbox".
Released models include better-base, hh-rlhf-sft, and socially-good-lm.
Training utilizes a cosine learning rate scheduler with weight decay and warmup.

Maintenance & Community

The project is associated with the paper "Training Socially Aligned Language Models in Simulated Human Society" by Liu et al. (2023). No specific community channels or active maintenance signals are detailed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code and data are presented for research purposes, and commercial use would require clarification.

Limitations & Caveats

The project relies heavily on OpenAI's API for simulation agents, incurring costs and external dependencies. The "Stable Alignment" method's generalizability and robustness beyond the described simulations require further validation. The training process requires significant computational resources and specific FSDP configurations.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days