Spurious_Rewards by ruixin31

RL fine-tuning research

Created 7 months ago

347 stars

Top 80.0% on SourcePulse

Project Summary

This repository addresses the issue of spurious rewards in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). It provides a framework and experimental results to demonstrate how carefully curated reward signals can improve model performance, particularly in complex reasoning tasks. The project is targeted at researchers and engineers working on LLM alignment and fine-tuning.

How It Works

The project investigates the impact of different reward functions on LLM training, proposing that "spurious rewards" (e.g., superficial formatting or irrelevant content) can mislead the learning process. It leverages the TTRL framework, building upon OpenRLHF, and introduces custom features like asynchronous evaluation. The core idea is to isolate and test specific reward signals, such as mathematical equivalence or correct Python formatting, to understand their contribution to improved reasoning capabilities.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n spurious-rewards python=3.10), activate it, and install requirements (pip install -r requirements.txt, pip install flash_attn==2.7.0.post2, pip install -e .).
Prerequisites: Python 3.10, CUDA 11.8+ (for flash_attn), and specific hardware for exact reproduction (NVIDIA A100 80GB or H200).
Resources: Requires significant computational resources for training and evaluation.
Links: Github, Website, Paper, Wandb, Models.

Highlighted Details

Focuses on isolating and evaluating specific reward signals for RLHF.
Demonstrates improved performance on reasoning tasks by mitigating spurious rewards.
Codebase is built on TTRL and OpenRLHF with added custom features.
Provides reproduction scripts for evaluation results.

Maintenance & Community

The project lists numerous academic affiliations for its authors, indicating a strong research backing. Links to Twitter and a Notion site are provided for community engagement and project information.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Exact reproduction of evaluation results requires specific, high-end GPU hardware (NVIDIA A100 80GB or H200) and matching --shards parameters due to potential generation fluctuations influenced by batch size in VLLM. The project appears to be research-oriented, and its readiness for production deployment is not detailed.

Spurious_Rewards by ruixin31

Explore Similar Projects

POLARIS by ChenxinAn-fdu

Intuitor by sunblaze-ucb

machina by DeepX-inc

personal_chatgpt by chunhuizhang

DAPO by BytedTsinghua-SIA

reward-bench by allenai

Visual-RFT by Liuziyu77

DeepRLHacks by williamFalcon

Awesome-RL-for-LRMs by TsinghuaC3I

pfrl by pfnet

RL-Factory by Simple-Efficient

TinyZero by Jiayi-Pan