Online-DPO-R1  by RLHFlow

Codebase for iterative DPO using rule-based rewards

Created 7 months ago
257 stars

Top 98.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a codebase for implementing Iterative Direct Preference Optimization (DPO) with rule-based rewards, targeting researchers and practitioners in large language model (LLM) fine-tuning for mathematical reasoning. It offers an efficient alternative to PPO-based methods, demonstrating significant performance gains on various math benchmarks with simpler implementation.

How It Works

The project employs an iterative DPO pipeline where model responses are sampled and assigned rewards using predefined rules. These rewards are then used to construct preference pairs for DPO training. The "exploration" phase utilizes a best-of-n versus worst-of-n sampling strategy to generate these preference pairs. For RAFT training, a similar pipeline is used, but only positive data is leveraged for fine-tuning.

Quick Start & Requirements

  • Generation Environment: conda create -n vllm python=3.10.9, pip install vllm==0.5.4 accelerate==0.33.0 deepspeed==0.14.5 transformers==4.48.1 numpy==1.26.4 antlr4-python3-runtime==4.7.2 sympy==1.12 latex2sympy2==1.9.1 word2number==1.1
  • Training Environment: conda create -n rlhflow python=3.10.9, pip3 install torch==2.1.2 torchvision torchaudio, python -m pip install . from within the cloned alignment-handbook directory (specific commit 27f7dbf00663dab66ad7334afb7a1311fa251f41), pip install flash-attn==2.6.3 accelerate==0.33.0 huggingface-hub==0.24.7 transformers==4.42.2 peft==0.7.1 deepspeed==0.15.4 trl==0.9.6 wandb.
  • Prerequisites: CUDA 12.0-12.6.
  • Running: bash run_iter_dpo.sh
  • Documentation: Evaluation scripts are available in the eval_math folder.

Highlighted Details

  • Achieves 51.8% average accuracy on math benchmarks after DPO with SFT warm-up, surpassing Llama-3.1-70B-Instruct and competitive with PPO methods.
  • Demonstrates that iterative DPO does not benefit from additional Negative Log-Likelihood (NLL) loss.
  • SFT warm-up before DPO is shown to improve model performance.
  • Rule-based DPO and RAFT are presented as efficient and easier-to-implement alternatives to PPO.

Maintenance & Community

The project acknowledges contributions from vLLM, VeRL, OpenRLHF, Qwen, and Axolotl communities. No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The README notes that PPO still achieves superior performance (51.8% vs. 50.0% for DPO) on the tested benchmarks, indicating DPO/RAFT are not yet on par with PPO for this task. A specific dependency on numpy<2.0 is highlighted to avoid issues.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
7 more.

reasoning-gym by open-thought

1.2%
1k
Procedural dataset generator for reasoning models
Created 7 months ago
Updated 3 days ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 1 month ago
Feedback? Help us improve.