LLM-RLHF-Tuning by Joyce94

LLM tuning via RLHF (SFT+RM+PPO+DPO) with LoRA

Created 2 years ago

446 stars

Top 67.2% on SourcePulse

Project Summary

This project provides a comprehensive, from-scratch implementation of the three-stage Reinforcement Learning from Human Feedback (RLHF) training process, including Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). It targets researchers and practitioners looking to fine-tune Large Language Models (LLMs) with advanced RL techniques, offering detailed implementation insights and flexible training configurations.

How It Works

The framework leverages the PEFT library, specifically LoRA adapters, to enable efficient fine-tuning. It supports loading multiple models (base model, SFT, RM, Actor, Critic) simultaneously, allowing for complex training setups. The implementation emphasizes flexibility, offering various configurations for model loading and distributed training via accelerate and deepspeed, including scenarios with single or multiple LoRA adapters and shared base models.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.10+, PyTorch 2.0.1+, accelerate==0.21.0, datasets==2.13.1, scikit-learn==1.3.0, sentencepiece==0.1.99, tqdm==4.65.0, transformers==4.31.0, wandb==0.15.8, peft==0.4.0, trl==0.5.0, deepspeed==0.10.0.
Supported Models: LLaMA, LLaMA2.
Documentation: Training guides for SFT, RM, PPO, and DPO are available.

Highlighted Details

Full RLHF pipeline: SFT, RM, PPO, and DPO.
Supports LLaMA2 and DPO training.
Flexible distributed training with accelerate and deepspeed.
Efficient training via LoRA adapters.
Detailed implementation notes for PPO.

Maintenance & Community

The project is actively maintained, with recent updates including LLaMA2 and DPO support. A WeChat group is available for discussion.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is under active development with several features listed as TODO, including improving PPO training stability, implementing PPO-max, DDPO, RRHF, RAFT, and supporting additional models like BLOOM and Baichuan. QLoRA training is also a future consideration.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days