FineGrainedRLHF  by allenai

Research paper for fine-grained RLHF

created 2 years ago
278 stars

Top 94.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the data, code, and trained models for the paper "Fine-Grained Human Feedback Gives Better Rewards for Language Model Training." It enables researchers and practitioners to implement and experiment with fine-grained Reinforcement Learning from Human Feedback (RLHF) for language model training, specifically demonstrating improvements in long-form question answering and detoxification tasks.

How It Works

The project implements RLHF by training reward models that capture specific aspects of response quality, such as irrelevance, factual accuracy, and completeness, in addition to a holistic preference model. These fine-grained reward models are then used to guide the language model's policy during RLHF training, aiming for more nuanced and targeted improvements compared to standard RLHF.

Quick Start & Requirements

  • Install:
    conda create --name py39 python=3.9
    conda activate py39
    git clone https://github.com/allenai/FineGrainedRLHF.git
    cd FineGrainedRLHF
    pip install -e .
    python -m spacy download en_core_web_sm
    
  • Prerequisites: Python 3.9, spaCy en_core_web_sm model. Training scripts mention 80G A100 GPUs.
  • Resources: Downloadable trained models are available.
  • Docs: Paper, Feedback Interfaces

Highlighted Details

  • Implements fine-grained reward modeling for irrelevance, repetition, incoherence, factual correctness, and information completeness.
  • Provides training scripts for both holistic and fine-grained RLHF.
  • Includes feedback collection interfaces for fine-grained and preference data.
  • Offers pre-trained SFT, reward, and RLHF models for convenience.

Maintenance & Community

The project is associated with Allen Institute for AI (AI2). No specific community channels (Discord/Slack) or roadmap are explicitly mentioned in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

RLHF training scripts are currently only provided for the qa-feedback task, with plans to add support for the detoxification task. Users need to manually adjust mean and std values for sequence-level reward models based on their own trained reward models or the provided mean_std.txt files.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.