SPPO  by uclaml

Self-Play Preference Optimization (SPPO) aligns language models via self-play

Created 1 year ago
581 stars

Top 55.7% on SourcePulse

GitHubView on GitHub
Project Summary

SPPO (Self-Play Preference Optimization) is a framework for efficiently aligning large language models (LLMs) using a self-play mechanism and a novel SPPO loss function. It aims to enhance LLM performance without relying on external preference data, outperforming methods like DPO and even proprietary models like GPT-4 on benchmarks like AlpacaEval 2.0. The target audience includes researchers and developers focused on LLM alignment and optimization.

How It Works

SPPO employs a self-play loop where the LLM generates responses, which are then ranked by a separate ranking model (PairRM). This ranking data is used to train the LLM via the SPPO loss, which is theoretically grounded to converge towards a Nash equilibrium. This approach allows the model to learn from its own generated outputs, creating a feedback loop for continuous improvement without external human or AI preferences.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies:
    git clone https://github.com/uclaml/SPPO.git
    cd SPPO
    pip install -e .
    
  • Prerequisites: Python 3.10, vllm, PairRM (from LLM-Blender repo), and potentially Hugging Face Hub write access for dataset pushing.
  • Setup: Requires cloning multiple repositories and installing several packages.
  • Resources: Training scripts are provided for Mistral-7B and Llama-3-8B, with multi-GPU support for generation and ranking.
  • Docs: Webpage, Paper, Huggingface

Highlighted Details

  • Achieves competitive or superior performance compared to larger models and proprietary LLMs on AlpacaEval 2.0.
  • Mistral-7B-SPPO (best-of-16) outperforms GPT-4 on AlpacaEval 2.0.
  • Llama-3-8B-SPPO demonstrates significant improvements over the base Llama-3-8B model.
  • Theoretically grounded in game theory for convergence guarantees.

Maintenance & Community

  • Codebase is based on alignment-handbook.
  • Uses vllm for generation and PairRM for ranking.
  • For questions, contact authors via email; for code issues, open a GitHub issue.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The underlying alignment-handbook and LLM-Blender dependencies may have their own licenses. Compatibility for commercial use is not specified.

Limitations & Caveats

  • The README does not specify a license for the SPPO code itself, which may impact commercial use.
  • Some training scripts attempt to push datasets to the Hugging Face Hub under UCLA-AGI, requiring write access or modification.
  • The pipeline currently supports a fixed number of generated samples per prompt (5).
Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect).

oat by sail-sg

2.5%
554
LLM online alignment framework for research
Created 1 year ago
Updated 4 days ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

self-rewarding-lm-pytorch by lucidrains

0.1%
1k
Training framework for self-rewarding language models
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.