SimPO  by princeton-nlp

Preference optimization algorithm for LLMs (NeurIPS 2024 paper)

Created 1 year ago
921 stars

Top 39.6% on SourcePulse

GitHubView on GitHub
Project Summary

SimPO (Simple Preference Optimization) is a novel preference optimization algorithm for large language models that achieves state-of-the-art performance without relying on a reference model, simplifying the training process and reducing computational overhead. It is designed for researchers and practitioners aiming to enhance LLM alignment and instruction-following capabilities.

How It Works

SimPO introduces a reference-free reward formulation that directly optimizes the policy against a set of preferred and dispreferred responses. This approach avoids the complexity and potential biases associated with maintaining a separate reference model, leading to a more streamlined and efficient training pipeline. The core innovation lies in its ability to learn a reward signal implicitly from preference data, enabling direct policy updates.

Quick Start & Requirements

  • Installation: Clone the alignment-handbook repository and install dependencies using pip install .. Requires PyTorch v2.2.2 and Flash Attention 2.
  • Environment: A Python 3.10 Conda environment is recommended.
  • Training: Uses accelerate for distributed training. Example commands provided for Mistral and Llama3 models.
  • Evaluation: Relies on external repositories for AlpacaEval 2, Arena-Hard, and MT-Bench.
  • Links: Released Models, Training Scripts, Changelog

Highlighted Details

  • Outperforms DPO and its variants on AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks.
  • Gemma-2-9B-it-SimPO achieves the #1 rank on AlpacaEval 2 leaderboard with a 72.4 LC win rate.
  • Hyperparameter tuning is critical, with learning_rate, beta, and gamma_beta_ratio being key.
  • Llama3 v0.2 models show improved performance but may struggle with structured output generation due to potential forgetting.

Maintenance & Community

  • Active development with recent updates (Oct 2024) including released training curves.
  • Issues can be reported via GitHub issues; direct contact provided for questions.
  • GitHub Repository

Licensing & Compatibility

  • Codebase is built on alignment-handbook, which typically uses Apache 2.0. Specific license for SimPO itself is not explicitly stated in the README but is expected to be permissive for research and commercial use.

Limitations & Caveats

  • Reproducing AlpacaEval 2 results requires specific versions (alpaca-eval==0.6.2) due to recent changes in the evaluation library.
  • Llama3 v0.2 models may exhibit issues with structured output generation (e.g., JSON) due to a combination of base model characteristics and training hyperparameters.
Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

direct-preference-optimization by eric-mitchell

0.3%
3k
Reference implementation for Direct Preference Optimization (DPO)
Created 2 years ago
Updated 1 year ago
Starred by Tony Lee Tony Lee(Author of HELM; Research Engineer at Meta), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
24 more.

LLaMA-Factory by hiyouga

1.1%
58k
Unified fine-tuning tool for 100+ LLMs & VLMs (ACL 2024)
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.