Preference optimization without a reference model
Top 66.7% on sourcepulse
ORPO (Monolithic Preference Optimization without Reference Model) is a novel method for aligning large language models (LLMs) with human preferences, offering an alternative to existing techniques like RLHF. It targets LLM researchers and developers seeking to improve model instruction following and preference alignment.
How It Works
ORPO directly optimizes the LLM's policy using a preference loss function that penalizes deviations from preferred responses and rewards disliking less preferred ones. This approach avoids the complexity and instability associated with training a separate reward model, simplifying the alignment pipeline.
Quick Start & Requirements
ORPOTrainer
is in trl/test_orpo_trainer_demo.py
.Highlighted Details
kaist-ai/mistral-orpo-capybara-7k
, kaist-ai/mistral-orpo-alpha
, and kaist-ai/mistral-orpo-beta
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
1 day