LLM-RLHF-Tuning  by Joyce94

LLM tuning via RLHF (SFT+RM+PPO+DPO) with LoRA

Created 2 years ago
436 stars

Top 68.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive, from-scratch implementation of the three-stage Reinforcement Learning from Human Feedback (RLHF) training process, including Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). It targets researchers and practitioners looking to fine-tune Large Language Models (LLMs) with advanced RL techniques, offering detailed implementation insights and flexible training configurations.

How It Works

The framework leverages the PEFT library, specifically LoRA adapters, to enable efficient fine-tuning. It supports loading multiple models (base model, SFT, RM, Actor, Critic) simultaneously, allowing for complex training setups. The implementation emphasizes flexibility, offering various configurations for model loading and distributed training via accelerate and deepspeed, including scenarios with single or multiple LoRA adapters and shared base models.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.10+, PyTorch 2.0.1+, accelerate==0.21.0, datasets==2.13.1, scikit-learn==1.3.0, sentencepiece==0.1.99, tqdm==4.65.0, transformers==4.31.0, wandb==0.15.8, peft==0.4.0, trl==0.5.0, deepspeed==0.10.0.
  • Supported Models: LLaMA, LLaMA2.
  • Documentation: Training guides for SFT, RM, PPO, and DPO are available.

Highlighted Details

  • Full RLHF pipeline: SFT, RM, PPO, and DPO.
  • Supports LLaMA2 and DPO training.
  • Flexible distributed training with accelerate and deepspeed.
  • Efficient training via LoRA adapters.
  • Detailed implementation notes for PPO.

Maintenance & Community

The project is actively maintained, with recent updates including LLaMA2 and DPO support. A WeChat group is available for discussion.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is under active development with several features listed as TODO, including improving PPO training stability, implementing PPO-max, DDPO, RRHF, RAFT, and supporting additional models like BLOOM and Baichuan. QLoRA training is also a future consideration.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

picotron by huggingface

0.7%
2k
Minimalist distributed training framework for educational use
Created 1 year ago
Updated 2 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI).

Pai-Megatron-Patch by alibaba

0.6%
1k
Training toolkit for LLMs & VLMs using Megatron
Created 2 years ago
Updated 4 days ago
Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

LLaMA-Adapter by OpenGVLab

0.0%
6k
Efficient fine-tuning for instruction-following LLaMA models
Created 2 years ago
Updated 1 year ago
Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
25 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.