SPPO by uclaml

Self-Play Preference Optimization (SPPO) aligns language models via self-play

Created 1 year ago

583 stars

Top 55.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Johannes Hagemann

Cofounder of Prime Intellect

Edward Sun

Research Scientist at Meta Superintelligence Lab

Wing Lian

Founder of Axolotl AI

Project Summary

SPPO (Self-Play Preference Optimization) is a framework for efficiently aligning large language models (LLMs) using a self-play mechanism and a novel SPPO loss function. It aims to enhance LLM performance without relying on external preference data, outperforming methods like DPO and even proprietary models like GPT-4 on benchmarks like AlpacaEval 2.0. The target audience includes researchers and developers focused on LLM alignment and optimization.

How It Works

SPPO employs a self-play loop where the LLM generates responses, which are then ranked by a separate ranking model (PairRM). This ranking data is used to train the LLM via the SPPO loss, which is theoretically grounded to converge towards a Nash equilibrium. This approach allows the model to learn from its own generated outputs, creating a feedback loop for continuous improvement without external human or AI preferences.

Quick Start & Requirements

Install: Clone the repository and install dependencies:

git clone https://github.com/uclaml/SPPO.git
cd SPPO
pip install -e .

Prerequisites: Python 3.10, vllm, PairRM (from LLM-Blender repo), and potentially Hugging Face Hub write access for dataset pushing.
Setup: Requires cloning multiple repositories and installing several packages.
Resources: Training scripts are provided for Mistral-7B and Llama-3-8B, with multi-GPU support for generation and ranking.
Docs: Webpage, Paper, Huggingface

Highlighted Details

Achieves competitive or superior performance compared to larger models and proprietary LLMs on AlpacaEval 2.0.
Mistral-7B-SPPO (best-of-16) outperforms GPT-4 on AlpacaEval 2.0.
Llama-3-8B-SPPO demonstrates significant improvements over the base Llama-3-8B model.
Theoretically grounded in game theory for convergence guarantees.

Maintenance & Community

Codebase is based on alignment-handbook.
Uses vllm for generation and PairRM for ranking.
For questions, contact authors via email; for code issues, open a GitHub issue.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. The underlying alignment-handbook and LLM-Blender dependencies may have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

The README does not specify a license for the SPPO code itself, which may impact commercial use.
Some training scripts attempt to push datasets to the Hugging Face Hub under UCLA-AGI, requiring write access or modification.
The pipeline currently supports a fixed number of generated samples per prompt (5).

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days