Discover and explore top open-source AI tools and projects—updated daily.
Codebase for iterative DPO using rule-based rewards
Top 98.4% on SourcePulse
This repository provides a codebase for implementing Iterative Direct Preference Optimization (DPO) with rule-based rewards, targeting researchers and practitioners in large language model (LLM) fine-tuning for mathematical reasoning. It offers an efficient alternative to PPO-based methods, demonstrating significant performance gains on various math benchmarks with simpler implementation.
How It Works
The project employs an iterative DPO pipeline where model responses are sampled and assigned rewards using predefined rules. These rewards are then used to construct preference pairs for DPO training. The "exploration" phase utilizes a best-of-n versus worst-of-n sampling strategy to generate these preference pairs. For RAFT training, a similar pipeline is used, but only positive data is leveraged for fine-tuning.
Quick Start & Requirements
conda create -n vllm python=3.10.9
, pip install vllm==0.5.4 accelerate==0.33.0 deepspeed==0.14.5 transformers==4.48.1 numpy==1.26.4 antlr4-python3-runtime==4.7.2 sympy==1.12 latex2sympy2==1.9.1 word2number==1.1
conda create -n rlhflow python=3.10.9
, pip3 install torch==2.1.2 torchvision torchaudio
, python -m pip install .
from within the cloned alignment-handbook
directory (specific commit 27f7dbf00663dab66ad7334afb7a1311fa251f41
), pip install flash-attn==2.6.3 accelerate==0.33.0 huggingface-hub==0.24.7 transformers==4.42.2 peft==0.7.1 deepspeed==0.15.4 trl==0.9.6 wandb
.bash run_iter_dpo.sh
eval_math
folder.Highlighted Details
Maintenance & Community
The project acknowledges contributions from vLLM, VeRL, OpenRLHF, Qwen, and Axolotl communities. No specific community links (Discord/Slack) or roadmap are provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
The README notes that PPO still achieves superior performance (51.8% vs. 50.0% for DPO) on the tested benchmarks, indicating DPO/RAFT are not yet on par with PPO for this task. A specific dependency on numpy<2.0
is highlighted to avoid issues.
5 months ago
Inactive