RRHF for aligning LLMs to human preferences
Top 44.5% on sourcepulse
This repository introduces RRHF (Rank Response from Human Feedback), a simplified method for aligning large language models with human preferences, and Wombat, an open-sourced chatbot model. It targets researchers and developers seeking more accessible alternatives to complex RLHF techniques like PPO for fine-tuning LLMs.
How It Works
RRHF streamlines human preference alignment by replacing the intricate PPO algorithm with a simpler ranking-based approach. Instead of complex policy-reward interactions, RRHF directly ranks responses, making the alignment process as straightforward as conventional fine-tuning. This reduces coding complexity, model count, and hyperparameter tuning, while achieving comparable results to PPO in fluency and alignment scores.
Quick Start & Requirements
transformers
from GitHub.requirements.txt
dependencies. Training requires 8x A100 80GB GPUs, bf16, and FSDP.Highlighted Details
Maintenance & Community
The project is associated with authors from Alibaba and Tsinghua University. Contact emails are provided for suggestions and discussions.
Licensing & Compatibility
The dataset is CC BY NC 4.0, restricting commercial use. Models trained on this dataset are also limited to research purposes.
Limitations & Caveats
The current implementation relies on a pre-trained reward model for synthetic human feedback, serving as a proof-of-concept. Future work aims to incorporate more efficient training methods like LoRA to reduce computational requirements.
1 year ago
1 day