orpo  by xfactlab

Preference optimization without a reference model

Created 1 year ago
464 stars

Top 65.4% on SourcePulse

GitHubView on GitHub
Project Summary

ORPO (Monolithic Preference Optimization without Reference Model) is a novel method for aligning large language models (LLMs) with human preferences, offering an alternative to existing techniques like RLHF. It targets LLM researchers and developers seeking to improve model instruction following and preference alignment.

How It Works

ORPO directly optimizes the LLM's policy using a preference loss function that penalizes deviations from preferred responses and rewards disliking less preferred ones. This approach avoids the complexity and instability associated with training a separate reward model, simplifying the alignment pipeline.

Quick Start & Requirements

  • Install: Integration with 🤗 TRL, Axolotl, and LLaMA-Factory is available. A sample script for ORPOTrainer is in trl/test_orpo_trainer_demo.py.
  • Prerequisites: Requires Python and Hugging Face libraries. Specific hardware requirements (e.g., GPU, VRAM) depend on the model size and training configuration.
  • Resources: Links to Wandb reports for model checkpoints are provided.

Highlighted Details

  • Mistral-ORPO-β achieved a 14.7% length-controlled win rate on the AlpacaEval Leaderboard.
  • Provides pre-trained model checkpoints like kaist-ai/mistral-orpo-capybara-7k, kaist-ai/mistral-orpo-alpha, and kaist-ai/mistral-orpo-beta.
  • Includes performance results on AlpacaEval, MT-Bench, and IFEval benchmarks.

Maintenance & Community

  • Official repository for ORPO.
  • Updates indicate ongoing development and integration efforts.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Detailed training logs for Mistral-ORPO-Capybara-7k are marked as "TBU" (To Be Updated).
  • The project appears to be in active development, with some components potentially subject to change.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

direct-preference-optimization by eric-mitchell

0.3%
3k
Reference implementation for Direct Preference Optimization (DPO)
Created 2 years ago
Updated 1 year ago
Starred by Tony Lee Tony Lee(Author of HELM; Research Engineer at Meta), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
24 more.

LLaMA-Factory by hiyouga

1.1%
58k
Unified fine-tuning tool for 100+ LLMs & VLMs (ACL 2024)
Created 2 years ago
Updated 2 days ago
Feedback? Help us improve.