Online-DPO-R1 by RLHFlow

Codebase for iterative DPO using rule-based rewards

Created 11 months ago

267 stars

Top 96.1% on SourcePulse

Project Summary

This repository provides a codebase for implementing Iterative Direct Preference Optimization (DPO) with rule-based rewards, targeting researchers and practitioners in large language model (LLM) fine-tuning for mathematical reasoning. It offers an efficient alternative to PPO-based methods, demonstrating significant performance gains on various math benchmarks with simpler implementation.

How It Works

The project employs an iterative DPO pipeline where model responses are sampled and assigned rewards using predefined rules. These rewards are then used to construct preference pairs for DPO training. The "exploration" phase utilizes a best-of-n versus worst-of-n sampling strategy to generate these preference pairs. For RAFT training, a similar pipeline is used, but only positive data is leveraged for fine-tuning.

Quick Start & Requirements

Generation Environment: conda create -n vllm python=3.10.9, pip install vllm==0.5.4 accelerate==0.33.0 deepspeed==0.14.5 transformers==4.48.1 numpy==1.26.4 antlr4-python3-runtime==4.7.2 sympy==1.12 latex2sympy2==1.9.1 word2number==1.1
Training Environment: conda create -n rlhflow python=3.10.9, pip3 install torch==2.1.2 torchvision torchaudio, python -m pip install . from within the cloned alignment-handbook directory (specific commit 27f7dbf00663dab66ad7334afb7a1311fa251f41), pip install flash-attn==2.6.3 accelerate==0.33.0 huggingface-hub==0.24.7 transformers==4.42.2 peft==0.7.1 deepspeed==0.15.4 trl==0.9.6 wandb.
Prerequisites: CUDA 12.0-12.6.
Running: bash run_iter_dpo.sh
Documentation: Evaluation scripts are available in the eval_math folder.

Highlighted Details

Achieves 51.8% average accuracy on math benchmarks after DPO with SFT warm-up, surpassing Llama-3.1-70B-Instruct and competitive with PPO methods.
Demonstrates that iterative DPO does not benefit from additional Negative Log-Likelihood (NLL) loss.
SFT warm-up before DPO is shown to improve model performance.
Rule-based DPO and RAFT are presented as efficient and easier-to-implement alternatives to PPO.

Maintenance & Community

The project acknowledges contributions from vLLM, VeRL, OpenRLHF, Qwen, and Axolotl communities. No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README notes that PPO still achieves superior performance (51.8% vs. 50.0% for DPO) on the tested benchmarks, indicating DPO/RAFT are not yet on par with PPO for this task. A specific dependency on numpy<2.0 is highlighted to avoid issues.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days