Discover and explore top open-source AI tools and projects—updated daily.
eric-mitchellReference implementation for Direct Preference Optimization (DPO)
Top 16.7% on SourcePulse
This repository provides a reference implementation for Direct Preference Optimization (DPO) and its variants, Conservative DPO and IPO, for training language models from preference data. It is designed for researchers and practitioners looking to fine-tune causal HuggingFace models using preference datasets. The implementation allows for easy integration of custom datasets and models, offering a flexible framework for preference-based alignment.
How It Works
The DPO process involves two stages: supervised fine-tuning (SFT) on a dataset, followed by preference learning using preference data. The core idea is to directly optimize the language model policy using a loss function derived from the preference data, bypassing the need for an explicit reward model. This approach simplifies the training pipeline and can lead to more stable and efficient learning.
Quick Start & Requirements
pip install -r requirements.txtconfig/model for model configurations and preference_datasets.py for dataset integration.Highlighted Details
loss.label_smoothing), and IPO (via loss=ipo).BasicTrainer (multi-GPU naive), FSDPTrainer (PyTorch FSDP), and experimental TensorParallelTrainer.bfloat16, float16) and activation checkpointing for performance optimization with FSDP.Maintenance & Community
The project is associated with the authors of the DPO paper. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
Sampling during evaluation can be slow with FSDPTrainer and TensorParallelTrainer. The TensorParallelTrainer is noted as experimental. The README suggests setting ulimit -n 64000 for FSDPTrainer and recommends sample_during_eval=false for performance.
1 year ago
Inactive
xfactlab
princeton-nlp
XueFuzhao
epfLLM
hiyouga
huggingface