simple_GRPO  by lsdefine

GRPO implementation for reproducing LLM reasoning, like r1

Created 7 months ago
1,327 stars

Top 30.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a simplified implementation of the GRPO (Gated Proximal Policy Optimization) algorithm for Large Language Models, targeting researchers and engineers who need to understand and experiment with RLHF (Reinforcement Learning from Human Feedback) concepts. It aims to reduce GPU memory usage and facilitate rapid iteration on RL training parameters and techniques.

How It Works

The implementation leverages Hugging Face's trl library for its core loss calculation formula. A key architectural choice is the decoupling of the reference model, allowing it to run on separate GPUs or machines. This significantly reduces memory overhead on the training GPU, enabling the training of larger models (e.g., 7B parameters) on more accessible hardware. The project also incorporates optimizations like Triton for loss calculation and vLLM for accelerated inference.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Requires at least two GPUs.
  • Example usage involves running a reference model server on one GPU (CUDA_VISIBLE_DEVICES=7 python ref_server.py) and the training process on others (CUDA_VISIBLE_DEVICES=2,3,4,5,6 deepspeed grpo_vllm_one.py).
  • Official documentation and demo are not explicitly linked, but usage examples are provided.

Highlighted Details

  • Achieves rapid training times, with Qwen2.5-3B and 7B models showing an "Aha moment" within 30 steps on a single A800 GPU.
  • Codebase is intentionally kept simple (approx. 200 lines across 2 files) for ease of understanding and modification.
  • Supports experimental features like regrouping, KL penalty, and parameter tuning.
  • Includes a recent Triton implementation for potential speedups and the reinforce++ algorithm.

Maintenance & Community

The project is led by researchers from Fudan University's KnowledgeWorks Lab. Core development is handled by Ph.D. and Master's students. Community channels like Discord/Slack are not mentioned.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source integration.

Limitations & Caveats

The project is described as "simple" and "experimental." Known limitations include potential invalid answer generation due to group imbalances and tight GPU memory requirements for generating long context outputs, which the team is actively addressing.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
48 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 1 month ago
Feedback? Help us improve.