LLM tuning via RLHF (SFT+RM+PPO+DPO) with LoRA
Top 69.9% on sourcepulse
This project provides a comprehensive, from-scratch implementation of the three-stage Reinforcement Learning from Human Feedback (RLHF) training process, including Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). It targets researchers and practitioners looking to fine-tune Large Language Models (LLMs) with advanced RL techniques, offering detailed implementation insights and flexible training configurations.
How It Works
The framework leverages the PEFT library, specifically LoRA adapters, to enable efficient fine-tuning. It supports loading multiple models (base model, SFT, RM, Actor, Critic) simultaneously, allowing for complex training setups. The implementation emphasizes flexibility, offering various configurations for model loading and distributed training via accelerate
and deepspeed
, including scenarios with single or multiple LoRA adapters and shared base models.
Quick Start & Requirements
pip install -r requirements.txt
accelerate==0.21.0
, datasets==2.13.1
, scikit-learn==1.3.0
, sentencepiece==0.1.99
, tqdm==4.65.0
, transformers==4.31.0
, wandb==0.15.8
, peft==0.4.0
, trl==0.5.0
, deepspeed==0.10.0
.Highlighted Details
accelerate
and deepspeed
.Maintenance & Community
The project is actively maintained, with recent updates including LLaMA2 and DPO support. A WeChat group is available for discussion.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is under active development with several features listed as TODO, including improving PPO training stability, implementing PPO-max, DDPO, RRHF, RAFT, and supporting additional models like BLOOM and Baichuan. QLoRA training is also a future consideration.
1 year ago
1 week