orpo by xfactlab

Preference optimization without a reference model

Created 1 year ago

468 stars

Top 64.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Lewis Tunstall

Research Engineer at Hugging Face

Wing Lian

Founder of Axolotl AI

Project Summary

ORPO (Monolithic Preference Optimization without Reference Model) is a novel method for aligning large language models (LLMs) with human preferences, offering an alternative to existing techniques like RLHF. It targets LLM researchers and developers seeking to improve model instruction following and preference alignment.

How It Works

ORPO directly optimizes the LLM's policy using a preference loss function that penalizes deviations from preferred responses and rewards disliking less preferred ones. This approach avoids the complexity and instability associated with training a separate reward model, simplifying the alignment pipeline.

Quick Start & Requirements

Install: Integration with 🤗 TRL, Axolotl, and LLaMA-Factory is available. A sample script for ORPOTrainer is in trl/test_orpo_trainer_demo.py.
Prerequisites: Requires Python and Hugging Face libraries. Specific hardware requirements (e.g., GPU, VRAM) depend on the model size and training configuration.
Resources: Links to Wandb reports for model checkpoints are provided.

Highlighted Details

Mistral-ORPO-β achieved a 14.7% length-controlled win rate on the AlpacaEval Leaderboard.
Provides pre-trained model checkpoints like kaist-ai/mistral-orpo-capybara-7k, kaist-ai/mistral-orpo-alpha, and kaist-ai/mistral-orpo-beta.
Includes performance results on AlpacaEval, MT-Bench, and IFEval benchmarks.

Maintenance & Community

Official repository for ORPO.
Updates indicate ongoing development and integration efforts.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Detailed training logs for Mistral-ORPO-Capybara-7k are marked as "TBU" (To Be Updated).
The project appears to be in active development, with some components potentially subject to change.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days