RRHF  by GanjinZero

RRHF for aligning LLMs to human preferences

created 2 years ago
811 stars

Top 44.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository introduces RRHF (Rank Response from Human Feedback), a simplified method for aligning large language models with human preferences, and Wombat, an open-sourced chatbot model. It targets researchers and developers seeking more accessible alternatives to complex RLHF techniques like PPO for fine-tuning LLMs.

How It Works

RRHF streamlines human preference alignment by replacing the intricate PPO algorithm with a simpler ranking-based approach. Instead of complex policy-reward interactions, RRHF directly ranks responses, making the alignment process as straightforward as conventional fine-tuning. This reduces coding complexity, model count, and hyperparameter tuning, while achieving comparable results to PPO in fluency and alignment scores.

Quick Start & Requirements

  • Installation: Requires Python 3.8 and PyTorch 1.13.0+cu116. Install Hugging Face's transformers from GitHub.
  • Prerequisites: CUDA 11.6, requirements.txt dependencies. Training requires 8x A100 80GB GPUs, bf16, and FSDP.
  • Data: Uses Anthropic's HH dataset or provided generated data.
  • Links: Paper, Wombat Weights, Alpaca.

Highlighted Details

  • RRHF achieves comparable PPL and Reward scores to PPO on LLaMA and Alpaca models using the HH dataset.
  • Wombat models (Wombat-7B, Wombat-7B-GPT4) are released, built upon Alpaca-7B and aligned using RRHF with various data sources.
  • Preliminary experiments show Wombat-7B outperforming Alpaca-7B on a Vicuna test set.
  • Math and programming skills are noted as weak points for all LLaMA-7B based models.

Maintenance & Community

The project is associated with authors from Alibaba and Tsinghua University. Contact emails are provided for suggestions and discussions.

Licensing & Compatibility

The dataset is CC BY NC 4.0, restricting commercial use. Models trained on this dataset are also limited to research purposes.

Limitations & Caveats

The current implementation relies on a pre-trained reward model for synthetic human feedback, serving as a proof-of-concept. Future work aims to incorporate more efficient training methods like LoRA to reduce computational requirements.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.