SPPO  by uclaml

Self-Play Preference Optimization (SPPO) aligns language models via self-play

Created 1 year ago
580 stars

Top 55.8% on SourcePulse

GitHubView on GitHub
Project Summary

SPPO (Self-Play Preference Optimization) is a framework for efficiently aligning large language models (LLMs) using a self-play mechanism and a novel SPPO loss function. It aims to enhance LLM performance without relying on external preference data, outperforming methods like DPO and even proprietary models like GPT-4 on benchmarks like AlpacaEval 2.0. The target audience includes researchers and developers focused on LLM alignment and optimization.

How It Works

SPPO employs a self-play loop where the LLM generates responses, which are then ranked by a separate ranking model (PairRM). This ranking data is used to train the LLM via the SPPO loss, which is theoretically grounded to converge towards a Nash equilibrium. This approach allows the model to learn from its own generated outputs, creating a feedback loop for continuous improvement without external human or AI preferences.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies:
    git clone https://github.com/uclaml/SPPO.git
    cd SPPO
    pip install -e .
    
  • Prerequisites: Python 3.10, vllm, PairRM (from LLM-Blender repo), and potentially Hugging Face Hub write access for dataset pushing.
  • Setup: Requires cloning multiple repositories and installing several packages.
  • Resources: Training scripts are provided for Mistral-7B and Llama-3-8B, with multi-GPU support for generation and ranking.
  • Docs: Webpage, Paper, Huggingface

Highlighted Details

  • Achieves competitive or superior performance compared to larger models and proprietary LLMs on AlpacaEval 2.0.
  • Mistral-7B-SPPO (best-of-16) outperforms GPT-4 on AlpacaEval 2.0.
  • Llama-3-8B-SPPO demonstrates significant improvements over the base Llama-3-8B model.
  • Theoretically grounded in game theory for convergence guarantees.

Maintenance & Community

  • Codebase is based on alignment-handbook.
  • Uses vllm for generation and PairRM for ranking.
  • For questions, contact authors via email; for code issues, open a GitHub issue.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The underlying alignment-handbook and LLM-Blender dependencies may have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

  • The README does not specify a license for the SPPO code itself, which may impact commercial use.
  • Some training scripts attempt to push datasets to the Hugging Face Hub under UCLA-AGI, requiring write access or modification.
  • The pipeline currently supports a fixed number of generated samples per prompt (5).
Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

self-rewarding-lm-pytorch by lucidrains

0.1%
1k
Training framework for self-rewarding language models
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.