Minimal-RL by RLHFlow

LLM fine-tuning for mathematical reasoning via RL

Created 10 months ago

264 stars

Top 96.7% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> This project investigates reinforcement learning (RL) algorithms for fine-tuning large language models (LLMs) on mathematical reasoning tasks. It compares RAFT++ (rejection sampling), Vanilla Reinforce, and GRPO to understand factors behind LLM fine-tuning success. The research offers insights into algorithm performance, convergence, and exploration strategies, introducing Reinforce-rej, a new, more KL-efficient variant.

How It Works

The project revisits and enhances RL algorithms for LLM post-training. RAFT++ is a basic rejection sampling method with added importance sampling and clipping. Vanilla Reinforce is a simplified policy gradient algorithm without a critic. GRPO, a Reinforce variant, samples multiple responses per prompt and normalizes rewards. Key findings show RAFT++ offers competitive performance and faster early convergence. The research emphasizes that while positive-only training accelerates convergence, negative samples are vital for exploration and preventing distributional collapse, a benefit lacking in RAFT++. GRPO's advantage over standard Reinforce is attributed to its implicit filtering of prompts with universally incorrect responses.

Quick Start & Requirements

Environment setup requires a Python virtual environment (e.g., python -m venv ~/.python/raftpp) or Conda. Dependencies include PyTorch 2.4.0 with CUDA 12.4 (torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124), flash-attn, and

Minimal-RL by RLHFlow

Explore Similar Projects

Label-Free-RLVR by QingyangZhang

LLM-with-RL-papers by floodsung

Entropy-Mechanism-of-RL by PRIME-RL

Spurious_Rewards by ruixin31

GDPO by NVlabs

Reinforcement-Learning-Papers by yingchengyang

Intuitor by sunblaze-ucb

LUFFY by ElliottYan

understand-r1-zero by sail-sg

TTRL by PRIME-RL

PRIME by PRIME-RL

DeepRLHacks by williamFalcon