SPPO  by uclaml

Self-Play Preference Optimization (SPPO) aligns language models via self-play

created 1 year ago
572 stars

Top 57.2% on sourcepulse

GitHubView on GitHub
Project Summary

SPPO (Self-Play Preference Optimization) is a framework for efficiently aligning large language models (LLMs) using a self-play mechanism and a novel SPPO loss function. It aims to enhance LLM performance without relying on external preference data, outperforming methods like DPO and even proprietary models like GPT-4 on benchmarks like AlpacaEval 2.0. The target audience includes researchers and developers focused on LLM alignment and optimization.

How It Works

SPPO employs a self-play loop where the LLM generates responses, which are then ranked by a separate ranking model (PairRM). This ranking data is used to train the LLM via the SPPO loss, which is theoretically grounded to converge towards a Nash equilibrium. This approach allows the model to learn from its own generated outputs, creating a feedback loop for continuous improvement without external human or AI preferences.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies:
    git clone https://github.com/uclaml/SPPO.git
    cd SPPO
    pip install -e .
    
  • Prerequisites: Python 3.10, vllm, PairRM (from LLM-Blender repo), and potentially Hugging Face Hub write access for dataset pushing.
  • Setup: Requires cloning multiple repositories and installing several packages.
  • Resources: Training scripts are provided for Mistral-7B and Llama-3-8B, with multi-GPU support for generation and ranking.
  • Docs: Webpage, Paper, Huggingface

Highlighted Details

  • Achieves competitive or superior performance compared to larger models and proprietary LLMs on AlpacaEval 2.0.
  • Mistral-7B-SPPO (best-of-16) outperforms GPT-4 on AlpacaEval 2.0.
  • Llama-3-8B-SPPO demonstrates significant improvements over the base Llama-3-8B model.
  • Theoretically grounded in game theory for convergence guarantees.

Maintenance & Community

  • Codebase is based on alignment-handbook.
  • Uses vllm for generation and PairRM for ranking.
  • For questions, contact authors via email; for code issues, open a GitHub issue.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The underlying alignment-handbook and LLM-Blender dependencies may have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

  • The README does not specify a license for the SPPO code itself, which may impact commercial use.
  • Some training scripts attempt to push datasets to the Hugging Face Hub under UCLA-AGI, requiring write access or modification.
  • The pipeline currently supports a fixed number of generated samples per prompt (5).
Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
1 more.

LLM-Blender by yuchenlin

0.4%
956
LLM ensembling framework using pairwise ranking and generative fusion
created 2 years ago
updated 9 months ago
Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code), Daniel Han Daniel Han(Cofounder of Unsloth), and
4 more.

open-instruct by allenai

0.2%
3k
Training codebase for instruction-following language models
created 2 years ago
updated 1 day ago
Feedback? Help us improve.