Self-Play Preference Optimization (SPPO) aligns language models via self-play
Top 57.2% on sourcepulse
SPPO (Self-Play Preference Optimization) is a framework for efficiently aligning large language models (LLMs) using a self-play mechanism and a novel SPPO loss function. It aims to enhance LLM performance without relying on external preference data, outperforming methods like DPO and even proprietary models like GPT-4 on benchmarks like AlpacaEval 2.0. The target audience includes researchers and developers focused on LLM alignment and optimization.
How It Works
SPPO employs a self-play loop where the LLM generates responses, which are then ranked by a separate ranking model (PairRM). This ranking data is used to train the LLM via the SPPO loss, which is theoretically grounded to converge towards a Nash equilibrium. This approach allows the model to learn from its own generated outputs, creating a feedback loop for continuous improvement without external human or AI preferences.
Quick Start & Requirements
git clone https://github.com/uclaml/SPPO.git
cd SPPO
pip install -e .
LLM-Blender
repo), and potentially Hugging Face Hub write access for dataset pushing.Highlighted Details
Maintenance & Community
alignment-handbook
.vllm
for generation and PairRM
for ranking.Licensing & Compatibility
alignment-handbook
and LLM-Blender
dependencies may have their own licenses. Compatibility for commercial use is not specified.Limitations & Caveats
UCLA-AGI
, requiring write access or modification.6 months ago
1 day