SimPO by princeton-nlp

Preference optimization algorithm for LLMs (NeurIPS 2024 paper)

Created 1 year ago

937 stars

Top 39.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Wing Lian

Founder of Axolotl AI

Philipp Schmid

DevRel at Google DeepMind

Project Summary

SimPO (Simple Preference Optimization) is a novel preference optimization algorithm for large language models that achieves state-of-the-art performance without relying on a reference model, simplifying the training process and reducing computational overhead. It is designed for researchers and practitioners aiming to enhance LLM alignment and instruction-following capabilities.

How It Works

SimPO introduces a reference-free reward formulation that directly optimizes the policy against a set of preferred and dispreferred responses. This approach avoids the complexity and potential biases associated with maintaining a separate reference model, leading to a more streamlined and efficient training pipeline. The core innovation lies in its ability to learn a reward signal implicitly from preference data, enabling direct policy updates.

Quick Start & Requirements

Installation: Clone the alignment-handbook repository and install dependencies using pip install .. Requires PyTorch v2.2.2 and Flash Attention 2.
Environment: A Python 3.10 Conda environment is recommended.
Training: Uses accelerate for distributed training. Example commands provided for Mistral and Llama3 models.
Evaluation: Relies on external repositories for AlpacaEval 2, Arena-Hard, and MT-Bench.
Links: Released Models, Training Scripts, Changelog

Highlighted Details

Outperforms DPO and its variants on AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks.
Gemma-2-9B-it-SimPO achieves the #1 rank on AlpacaEval 2 leaderboard with a 72.4 LC win rate.
Hyperparameter tuning is critical, with learning_rate, beta, and gamma_beta_ratio being key.
Llama3 v0.2 models show improved performance but may struggle with structured output generation due to potential forgetting.

Maintenance & Community

Active development with recent updates (Oct 2024) including released training curves.
Issues can be reported via GitHub issues; direct contact provided for questions.
GitHub Repository

Licensing & Compatibility

Codebase is built on alignment-handbook, which typically uses Apache 2.0. Specific license for SimPO itself is not explicitly stated in the README but is expected to be permissive for research and commercial use.

Limitations & Caveats

Reproducing AlpacaEval 2 results requires specific versions (alpaca-eval==0.6.2) due to recent changes in the evaluation library.
Llama3 v0.2 models may exhibit issues with structured output generation (e.g., JSON) due to a combination of base model characteristics and training hyperparameters.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days