FineGrainedRLHF by allenai

Research paper for fine-grained RLHF

Created 2 years ago

281 stars

Top 92.8% on SourcePulse

Project Summary

This repository provides the data, code, and trained models for the paper "Fine-Grained Human Feedback Gives Better Rewards for Language Model Training." It enables researchers and practitioners to implement and experiment with fine-grained Reinforcement Learning from Human Feedback (RLHF) for language model training, specifically demonstrating improvements in long-form question answering and detoxification tasks.

How It Works

The project implements RLHF by training reward models that capture specific aspects of response quality, such as irrelevance, factual accuracy, and completeness, in addition to a holistic preference model. These fine-grained reward models are then used to guide the language model's policy during RLHF training, aiming for more nuanced and targeted improvements compared to standard RLHF.

Quick Start & Requirements

Install:

conda create --name py39 python=3.9
conda activate py39
git clone https://github.com/allenai/FineGrainedRLHF.git
cd FineGrainedRLHF
pip install -e .
python -m spacy download en_core_web_sm

Prerequisites: Python 3.9, spaCy en_core_web_sm model. Training scripts mention 80G A100 GPUs.
Resources: Downloadable trained models are available.
Docs: Paper, Feedback Interfaces

Highlighted Details

Implements fine-grained reward modeling for irrelevance, repetition, incoherence, factual correctness, and information completeness.
Provides training scripts for both holistic and fine-grained RLHF.
Includes feedback collection interfaces for fine-grained and preference data.
Offers pre-trained SFT, reward, and RLHF models for convenience.

Maintenance & Community

The project is associated with Allen Institute for AI (AI2). No specific community channels (Discord/Slack) or roadmap are explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

RLHF training scripts are currently only provided for the qa-feedback task, with plans to add support for the detoxification task. Users need to manually adjust mean and std values for sequence-level reward models based on their own trained reward models or the provided mean_std.txt files.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days