Research paper for fine-grained RLHF
Top 94.3% on sourcepulse
This repository provides the data, code, and trained models for the paper "Fine-Grained Human Feedback Gives Better Rewards for Language Model Training." It enables researchers and practitioners to implement and experiment with fine-grained Reinforcement Learning from Human Feedback (RLHF) for language model training, specifically demonstrating improvements in long-form question answering and detoxification tasks.
How It Works
The project implements RLHF by training reward models that capture specific aspects of response quality, such as irrelevance, factual accuracy, and completeness, in addition to a holistic preference model. These fine-grained reward models are then used to guide the language model's policy during RLHF training, aiming for more nuanced and targeted improvements compared to standard RLHF.
Quick Start & Requirements
conda create --name py39 python=3.9
conda activate py39
git clone https://github.com/allenai/FineGrainedRLHF.git
cd FineGrainedRLHF
pip install -e .
python -m spacy download en_core_web_sm
en_core_web_sm
model. Training scripts mention 80G A100 GPUs.Highlighted Details
Maintenance & Community
The project is associated with Allen Institute for AI (AI2). No specific community channels (Discord/Slack) or roadmap are explicitly mentioned in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.
Limitations & Caveats
RLHF training scripts are currently only provided for the qa-feedback
task, with plans to add support for the detoxification task. Users need to manually adjust mean and std values for sequence-level reward models based on their own trained reward models or the provided mean_std.txt
files.
6 months ago
1 week