Preference optimization algorithm for LLMs (NeurIPS 2024 paper)
Top 40.7% on sourcepulse
SimPO (Simple Preference Optimization) is a novel preference optimization algorithm for large language models that achieves state-of-the-art performance without relying on a reference model, simplifying the training process and reducing computational overhead. It is designed for researchers and practitioners aiming to enhance LLM alignment and instruction-following capabilities.
How It Works
SimPO introduces a reference-free reward formulation that directly optimizes the policy against a set of preferred and dispreferred responses. This approach avoids the complexity and potential biases associated with maintaining a separate reference model, leading to a more streamlined and efficient training pipeline. The core innovation lies in its ability to learn a reward signal implicitly from preference data, enabling direct policy updates.
Quick Start & Requirements
alignment-handbook
repository and install dependencies using pip install .
. Requires PyTorch v2.2.2 and Flash Attention 2.accelerate
for distributed training. Example commands provided for Mistral and Llama3 models.Highlighted Details
learning_rate
, beta
, and gamma_beta_ratio
being key.Maintenance & Community
Licensing & Compatibility
alignment-handbook
, which typically uses Apache 2.0. Specific license for SimPO itself is not explicitly stated in the README but is expected to be permissive for research and commercial use.Limitations & Caveats
alpaca-eval==0.6.2
) due to recent changes in the evaluation library.5 months ago
1 week