simple_GRPO by lsdefine

GRPO implementation for reproducing LLM reasoning, like r1

Created 11 months ago

1,533 stars

Top 26.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Bryan Helmig

Cofounder of Zapier

Vincent Weisser

Cofounder of Prime Intellect

Project Summary

This repository provides a simplified implementation of the GRPO (Gated Proximal Policy Optimization) algorithm for Large Language Models, targeting researchers and engineers who need to understand and experiment with RLHF (Reinforcement Learning from Human Feedback) concepts. It aims to reduce GPU memory usage and facilitate rapid iteration on RL training parameters and techniques.

How It Works

The implementation leverages Hugging Face's trl library for its core loss calculation formula. A key architectural choice is the decoupling of the reference model, allowing it to run on separate GPUs or machines. This significantly reduces memory overhead on the training GPU, enabling the training of larger models (e.g., 7B parameters) on more accessible hardware. The project also incorporates optimizations like Triton for loss calculation and vLLM for accelerated inference.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Requires at least two GPUs.
Example usage involves running a reference model server on one GPU (CUDA_VISIBLE_DEVICES=7 python ref_server.py) and the training process on others (CUDA_VISIBLE_DEVICES=2,3,4,5,6 deepspeed grpo_vllm_one.py).
Official documentation and demo are not explicitly linked, but usage examples are provided.

Highlighted Details

Achieves rapid training times, with Qwen2.5-3B and 7B models showing an "Aha moment" within 30 steps on a single A800 GPU.
Codebase is intentionally kept simple (approx. 200 lines across 2 files) for ease of understanding and modification.
Supports experimental features like regrouping, KL penalty, and parameter tuning.
Includes a recent Triton implementation for potential speedups and the reinforce++ algorithm.

Maintenance & Community

The project is led by researchers from Fudan University's KnowledgeWorks Lab. Core development is handled by Ph.D. and Master's students. Community channels like Discord/Slack are not mentioned.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The project is described as "simple" and "experimental." Known limitations include potential invalid answer generation due to group imbalances and tight GPU memory requirements for generating long context outputs, which the team is actively addressing.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

51 stars in the last 30 days