direct-preference-optimization  by eric-mitchell

Reference implementation for Direct Preference Optimization (DPO)

Created 2 years ago
2,734 stars

Top 17.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a reference implementation for Direct Preference Optimization (DPO) and its variants, Conservative DPO and IPO, for training language models from preference data. It is designed for researchers and practitioners looking to fine-tune causal HuggingFace models using preference datasets. The implementation allows for easy integration of custom datasets and models, offering a flexible framework for preference-based alignment.

How It Works

The DPO process involves two stages: supervised fine-tuning (SFT) on a dataset, followed by preference learning using preference data. The core idea is to directly optimize the language model policy using a loss function derived from the preference data, bypassing the need for an explicit reward model. This approach simplifies the training pipeline and can lead to more stable and efficient learning.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.8+, PyTorch, HuggingFace Transformers. GPU with CUDA is recommended for performance.
  • Setup: Create a virtual environment and install dependencies.
  • Examples: See config/model for model configurations and preference_datasets.py for dataset integration.
  • Docs: https://arxiv.org/abs/2305.18290 (Paper)

Highlighted Details

  • Supports original DPO, Conservative DPO (via loss.label_smoothing), and IPO (via loss=ipo).
  • Compatible with any causal HuggingFace model.
  • Offers multiple trainer options: BasicTrainer (multi-GPU naive), FSDPTrainer (PyTorch FSDP), and experimental TensorParallelTrainer.
  • Enables mixed precision training (bfloat16, float16) and activation checkpointing for performance optimization with FSDP.

Maintenance & Community

The project is associated with the authors of the DPO paper. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Sampling during evaluation can be slow with FSDPTrainer and TensorParallelTrainer. The TensorParallelTrainer is noted as experimental. The README suggests setting ulimit -n 64000 for FSDPTrainer and recommends sample_during_eval=false for performance.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.