RL fine-tuning research
Top 85.3% on sourcepulse
This repository addresses the issue of spurious rewards in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). It provides a framework and experimental results to demonstrate how carefully curated reward signals can improve model performance, particularly in complex reasoning tasks. The project is targeted at researchers and engineers working on LLM alignment and fine-tuning.
How It Works
The project investigates the impact of different reward functions on LLM training, proposing that "spurious rewards" (e.g., superficial formatting or irrelevant content) can mislead the learning process. It leverages the TTRL framework, building upon OpenRLHF, and introduces custom features like asynchronous evaluation. The core idea is to isolate and test specific reward signals, such as mathematical equivalence or correct Python formatting, to understand their contribution to improved reasoning capabilities.
Quick Start & Requirements
conda create -n spurious-rewards python=3.10
), activate it, and install requirements (pip install -r requirements.txt
, pip install flash_attn==2.7.0.post2
, pip install -e .
).flash_attn
), and specific hardware for exact reproduction (NVIDIA A100 80GB or H200).Highlighted Details
Maintenance & Community
The project lists numerous academic affiliations for its authors, indicating a strong research backing. Links to Twitter and a Notion site are provided for community engagement and project information.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Exact reproduction of evaluation results requires specific, high-end GPU hardware (NVIDIA A100 80GB or H200) and matching --shards
parameters due to potential generation fluctuations influenced by batch size in VLLM. The project appears to be research-oriented, and its readiness for production deployment is not detailed.
4 days ago
Inactive