LLM-RLHF-Tuning  by Joyce94

LLM tuning via RLHF (SFT+RM+PPO+DPO) with LoRA

created 2 years ago
432 stars

Top 69.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive, from-scratch implementation of the three-stage Reinforcement Learning from Human Feedback (RLHF) training process, including Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). It targets researchers and practitioners looking to fine-tune Large Language Models (LLMs) with advanced RL techniques, offering detailed implementation insights and flexible training configurations.

How It Works

The framework leverages the PEFT library, specifically LoRA adapters, to enable efficient fine-tuning. It supports loading multiple models (base model, SFT, RM, Actor, Critic) simultaneously, allowing for complex training setups. The implementation emphasizes flexibility, offering various configurations for model loading and distributed training via accelerate and deepspeed, including scenarios with single or multiple LoRA adapters and shared base models.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.10+, PyTorch 2.0.1+, accelerate==0.21.0, datasets==2.13.1, scikit-learn==1.3.0, sentencepiece==0.1.99, tqdm==4.65.0, transformers==4.31.0, wandb==0.15.8, peft==0.4.0, trl==0.5.0, deepspeed==0.10.0.
  • Supported Models: LLaMA, LLaMA2.
  • Documentation: Training guides for SFT, RM, PPO, and DPO are available.

Highlighted Details

  • Full RLHF pipeline: SFT, RM, PPO, and DPO.
  • Supports LLaMA2 and DPO training.
  • Flexible distributed training with accelerate and deepspeed.
  • Efficient training via LoRA adapters.
  • Detailed implementation notes for PPO.

Maintenance & Community

The project is actively maintained, with recent updates including LLaMA2 and DPO support. A WeChat group is available for discussion.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is under active development with several features listed as TODO, including improving PPO training stability, implementing PPO-max, DDPO, RRHF, RAFT, and supporting additional models like BLOOM and Baichuan. QLoRA training is also a future consideration.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.3%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Creator of Papers with Code), Daniel Han Daniel Han(Cofounder of Unsloth), and
4 more.

open-instruct by allenai

0.2%
3k
Training codebase for instruction-following language models
created 2 years ago
updated 16 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
9 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
created 2 years ago
updated 1 year ago
Feedback? Help us improve.