GRPO-Zero  by policy-gradient

Minimalist GRPO trainer for language models

created 3 months ago
1,501 stars

Top 28.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository implements the Group Relative Policy Optimization (GRPO) algorithm for training large language models with reinforcement learning, specifically targeting minimal dependencies and low GPU memory usage. It's designed for researchers and practitioners looking to fine-tune LLMs using RL without the overhead of complex frameworks like transformers or vLLM.

How It Works

GRPO trains LLMs by sampling multiple answers for each question and defining an answer's advantage based on its normalized reward, eliminating the need for a value estimation network. The implementation features token-level policy gradient loss, removing KL divergence (and thus the reference policy network) to save memory, and optional overlong episode filtering for training stability. The core update uses a PPO surrogate objective, simplified to a vanilla policy gradient estimation per token.

Quick Start & Requirements

  • Install: uv sync install
  • Prerequisites: git-lfs, python (via uv), pytorch.
  • Setup: Requires cloning datasets and pretrained models (e.g., Qwen2.5-3B-Instruct).
  • Training: uv run train.py or uv run train.py --config config_24GB.yaml for 24GB VRAM.
  • Links: Hugging Face Datasets, Qwen2.5-3B-Instruct

Highlighted Details

  • Minimal dependencies: Only relies on tokenizers and PyTorch.
  • Low VRAM usage: Configured for single A40 (48GB) or RTX 4090 (24GB) with CPU offloading.
  • GRPO enhancements: Token-level loss, KL divergence removal, overlong episode filtering.
  • Example task: Fine-tuning Qwen2.5 on the CountDown task with a specific reward structure.

Maintenance & Community

The project acknowledges contributions from DeepSeekMath, DAPO, TinyZero, and nano-aha-moment. No specific community links or roadmap are provided in the README.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as an implementation from scratch, suggesting potential for undiscovered bugs or missing features compared to more mature libraries. The "overlong episode filtering" is disabled by default.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
242 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.