Reference implementation for Direct Preference Optimization (DPO)
Top 18.0% on sourcepulse
This repository provides a reference implementation for Direct Preference Optimization (DPO) and its variants, Conservative DPO and IPO, for training language models from preference data. It is designed for researchers and practitioners looking to fine-tune causal HuggingFace models using preference datasets. The implementation allows for easy integration of custom datasets and models, offering a flexible framework for preference-based alignment.
How It Works
The DPO process involves two stages: supervised fine-tuning (SFT) on a dataset, followed by preference learning using preference data. The core idea is to directly optimize the language model policy using a loss function derived from the preference data, bypassing the need for an explicit reward model. This approach simplifies the training pipeline and can lead to more stable and efficient learning.
Quick Start & Requirements
pip install -r requirements.txt
config/model
for model configurations and preference_datasets.py
for dataset integration.Highlighted Details
loss.label_smoothing
), and IPO (via loss=ipo
).BasicTrainer
(multi-GPU naive), FSDPTrainer
(PyTorch FSDP), and experimental TensorParallelTrainer
.bfloat16
, float16
) and activation checkpointing for performance optimization with FSDP.Maintenance & Community
The project is associated with the authors of the DPO paper. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
Sampling during evaluation can be slow with FSDPTrainer
and TensorParallelTrainer
. The TensorParallelTrainer
is noted as experimental. The README suggests setting ulimit -n 64000
for FSDPTrainer
and recommends sample_during_eval=false
for performance.
11 months ago
1 week