orpo  by xfactlab

Preference optimization without a reference model

Created 1 year ago
463 stars

Top 65.4% on SourcePulse

GitHubView on GitHub
Project Summary

ORPO (Monolithic Preference Optimization without Reference Model) is a novel method for aligning large language models (LLMs) with human preferences, offering an alternative to existing techniques like RLHF. It targets LLM researchers and developers seeking to improve model instruction following and preference alignment.

How It Works

ORPO directly optimizes the LLM's policy using a preference loss function that penalizes deviations from preferred responses and rewards disliking less preferred ones. This approach avoids the complexity and instability associated with training a separate reward model, simplifying the alignment pipeline.

Quick Start & Requirements

  • Install: Integration with 🤗 TRL, Axolotl, and LLaMA-Factory is available. A sample script for ORPOTrainer is in trl/test_orpo_trainer_demo.py.
  • Prerequisites: Requires Python and Hugging Face libraries. Specific hardware requirements (e.g., GPU, VRAM) depend on the model size and training configuration.
  • Resources: Links to Wandb reports for model checkpoints are provided.

Highlighted Details

  • Mistral-ORPO-β achieved a 14.7% length-controlled win rate on the AlpacaEval Leaderboard.
  • Provides pre-trained model checkpoints like kaist-ai/mistral-orpo-capybara-7k, kaist-ai/mistral-orpo-alpha, and kaist-ai/mistral-orpo-beta.
  • Includes performance results on AlpacaEval, MT-Bench, and IFEval benchmarks.

Maintenance & Community

  • Official repository for ORPO.
  • Updates indicate ongoing development and integration efforts.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • Detailed training logs for Mistral-ORPO-Capybara-7k are marked as "TBU" (To Be Updated).
  • The project appears to be in active development, with some components potentially subject to change.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
3 more.

direct-preference-optimization by eric-mitchell

0.1%
3k
Reference implementation for Direct Preference Optimization (DPO)
Created 2 years ago
Updated 1 year ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Alex Chen Alex Chen(Cofounder of Nexa AI), and
28 more.

LLaMA-Factory by hiyouga

1.3%
62k
Unified fine-tuning tool for 100+ LLMs & VLMs (ACL 2024)
Created 2 years ago
Updated 10 hours ago
Feedback? Help us improve.