GRPO-Zero by policy-gradient

Minimalist GRPO trainer for language models

Created 9 months ago

1,736 stars

Top 24.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Project Summary

This repository implements the Group Relative Policy Optimization (GRPO) algorithm for training large language models with reinforcement learning, specifically targeting minimal dependencies and low GPU memory usage. It's designed for researchers and practitioners looking to fine-tune LLMs using RL without the overhead of complex frameworks like transformers or vLLM.

How It Works

GRPO trains LLMs by sampling multiple answers for each question and defining an answer's advantage based on its normalized reward, eliminating the need for a value estimation network. The implementation features token-level policy gradient loss, removing KL divergence (and thus the reference policy network) to save memory, and optional overlong episode filtering for training stability. The core update uses a PPO surrogate objective, simplified to a vanilla policy gradient estimation per token.

Quick Start & Requirements

Install: uv sync install
Prerequisites: git-lfs, python (via uv), pytorch.
Setup: Requires cloning datasets and pretrained models (e.g., Qwen2.5-3B-Instruct).
Training: uv run train.py or uv run train.py --config config_24GB.yaml for 24GB VRAM.
Links: Hugging Face Datasets, Qwen2.5-3B-Instruct

Highlighted Details

Minimal dependencies: Only relies on tokenizers and PyTorch.
Low VRAM usage: Configured for single A40 (48GB) or RTX 4090 (24GB) with CPU offloading.
GRPO enhancements: Token-level loss, KL divergence removal, overlong episode filtering.
Example task: Fine-tuning Qwen2.5 on the CountDown task with a specific reward structure.

Maintenance & Community

The project acknowledges contributions from DeepSeekMath, DAPO, TinyZero, and nano-aha-moment. No specific community links or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as an implementation from scratch, suggesting potential for undiscovered bugs or missing features compared to more mature libraries. The "overlong episode filtering" is disabled by default.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

42 stars in the last 30 days