Minimalist GRPO trainer for language models
Top 28.0% on sourcepulse
This repository implements the Group Relative Policy Optimization (GRPO) algorithm for training large language models with reinforcement learning, specifically targeting minimal dependencies and low GPU memory usage. It's designed for researchers and practitioners looking to fine-tune LLMs using RL without the overhead of complex frameworks like transformers
or vLLM
.
How It Works
GRPO trains LLMs by sampling multiple answers for each question and defining an answer's advantage based on its normalized reward, eliminating the need for a value estimation network. The implementation features token-level policy gradient loss, removing KL divergence (and thus the reference policy network) to save memory, and optional overlong episode filtering for training stability. The core update uses a PPO surrogate objective, simplified to a vanilla policy gradient estimation per token.
Quick Start & Requirements
uv sync install
git-lfs
, python
(via uv
), pytorch
.uv run train.py
or uv run train.py --config config_24GB.yaml
for 24GB VRAM.Highlighted Details
Maintenance & Community
The project acknowledges contributions from DeepSeekMath, DAPO, TinyZero, and nano-aha-moment. No specific community links or roadmap are provided in the README.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is presented as an implementation from scratch, suggesting potential for undiscovered bugs or missing features compared to more mature libraries. The "overlong episode filtering" is disabled by default.
3 months ago
1 day