RRHF by GanjinZero

RRHF for aligning LLMs to human preferences

Created 2 years ago

809 stars

Top 43.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Philipp Schmid

DevRel at Google DeepMind

Wing Lian

Founder of Axolotl AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository introduces RRHF (Rank Response from Human Feedback), a simplified method for aligning large language models with human preferences, and Wombat, an open-sourced chatbot model. It targets researchers and developers seeking more accessible alternatives to complex RLHF techniques like PPO for fine-tuning LLMs.

How It Works

RRHF streamlines human preference alignment by replacing the intricate PPO algorithm with a simpler ranking-based approach. Instead of complex policy-reward interactions, RRHF directly ranks responses, making the alignment process as straightforward as conventional fine-tuning. This reduces coding complexity, model count, and hyperparameter tuning, while achieving comparable results to PPO in fluency and alignment scores.

Quick Start & Requirements

Installation: Requires Python 3.8 and PyTorch 1.13.0+cu116. Install Hugging Face's transformers from GitHub.
Prerequisites: CUDA 11.6, requirements.txt dependencies. Training requires 8x A100 80GB GPUs, bf16, and FSDP.
Data: Uses Anthropic's HH dataset or provided generated data.
Links: Paper, Wombat Weights, Alpaca.

Highlighted Details

RRHF achieves comparable PPL and Reward scores to PPO on LLaMA and Alpaca models using the HH dataset.
Wombat models (Wombat-7B, Wombat-7B-GPT4) are released, built upon Alpaca-7B and aligned using RRHF with various data sources.
Preliminary experiments show Wombat-7B outperforming Alpaca-7B on a Vicuna test set.
Math and programming skills are noted as weak points for all LLaMA-7B based models.

Maintenance & Community

The project is associated with authors from Alibaba and Tsinghua University. Contact emails are provided for suggestions and discussions.

Licensing & Compatibility

The dataset is CC BY NC 4.0, restricting commercial use. Models trained on this dataset are also limited to research purposes.

Limitations & Caveats

The current implementation relies on a pre-trained reward model for synthetic human feedback, serving as a proof-of-concept. Future work aims to incorporate more efficient training methods like LoRA to reduce computational requirements.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days