direct-preference-optimization by eric-mitchell

Reference implementation for Direct Preference Optimization (DPO)

Created 2 years ago

2,825 stars

Top 16.7% on SourcePulse

View on GitHub

5 Experts Love This Project

Sebastian Raschka

Author of "Build a Large Language Model (From Scratch)"

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Yaowei Zheng

Author of LLaMA-Factory

Binyuan Hui

Research Scientist at Alibaba Qwen

and 1 more!

Project Summary

This repository provides a reference implementation for Direct Preference Optimization (DPO) and its variants, Conservative DPO and IPO, for training language models from preference data. It is designed for researchers and practitioners looking to fine-tune causal HuggingFace models using preference datasets. The implementation allows for easy integration of custom datasets and models, offering a flexible framework for preference-based alignment.

How It Works

The DPO process involves two stages: supervised fine-tuning (SFT) on a dataset, followed by preference learning using preference data. The core idea is to directly optimize the language model policy using a loss function derived from the preference data, bypassing the need for an explicit reward model. This approach simplifies the training pipeline and can lead to more stable and efficient learning.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.8+, PyTorch, HuggingFace Transformers. GPU with CUDA is recommended for performance.
Setup: Create a virtual environment and install dependencies.
Examples: See config/model for model configurations and preference_datasets.py for dataset integration.
Docs: https://arxiv.org/abs/2305.18290 (Paper)

Highlighted Details

Supports original DPO, Conservative DPO (via loss.label_smoothing), and IPO (via loss=ipo).
Compatible with any causal HuggingFace model.
Offers multiple trainer options: BasicTrainer (multi-GPU naive), FSDPTrainer (PyTorch FSDP), and experimental TensorParallelTrainer.
Enables mixed precision training (bfloat16, float16) and activation checkpointing for performance optimization with FSDP.

Maintenance & Community

The project is associated with the authors of the DPO paper. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Sampling during evaluation can be slow with FSDPTrainer and TensorParallelTrainer. The TensorParallelTrainer is noted as experimental. The README suggests setting ulimit -n 64000 for FSDPTrainer and recommends sample_during_eval=false for performance.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days