GRPO implementation for reproducing LLM reasoning, like r1
Top 32.7% on sourcepulse
This repository provides a simplified implementation of the GRPO (Gated Proximal Policy Optimization) algorithm for Large Language Models, targeting researchers and engineers who need to understand and experiment with RLHF (Reinforcement Learning from Human Feedback) concepts. It aims to reduce GPU memory usage and facilitate rapid iteration on RL training parameters and techniques.
How It Works
The implementation leverages Hugging Face's trl
library for its core loss calculation formula. A key architectural choice is the decoupling of the reference model, allowing it to run on separate GPUs or machines. This significantly reduces memory overhead on the training GPU, enabling the training of larger models (e.g., 7B parameters) on more accessible hardware. The project also incorporates optimizations like Triton for loss calculation and vLLM for accelerated inference.
Quick Start & Requirements
pip install -r requirements.txt
CUDA_VISIBLE_DEVICES=7 python ref_server.py
) and the training process on others (CUDA_VISIBLE_DEVICES=2,3,4,5,6 deepspeed grpo_vllm_one.py
).Highlighted Details
Maintenance & Community
The project is led by researchers from Fudan University's KnowledgeWorks Lab. Core development is handled by Ph.D. and Master's students. Community channels like Discord/Slack are not mentioned.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or closed-source integration.
Limitations & Caveats
The project is described as "simple" and "experimental." Known limitations include potential invalid answer generation due to group imbalances and tight GPU memory requirements for generating long context outputs, which the team is actively addressing.
1 week ago
1 day