M_GRPO by baibizhe

Stabilizing LLM reasoning with self-supervised RL

Created 2 months ago

332 stars

Top 82.3% on SourcePulse

Project Summary

M-GRPO tackles the critical challenge of instability, termed "policy collapse," in self-supervised reinforcement learning (RL) for Large Language Models (LLMs). It offers a solution for researchers and engineers aiming to enhance LLM reasoning capabilities using RL without relying on costly human-annotated data, providing a more stable and effective training paradigm.

How It Works

M-GRPO introduces a novel momentum-anchored approach to stabilize self-supervised reinforcement learning (RL) for Large Language Models (LLMs). It addresses the critical issue of "policy collapse," where training becomes unstable due to a lack of a consistent target signal in self-rewarding systems. The core innovation is a momentum model that evolves slowly, providing a stable and reliable reference point. This momentum-anchored signal is used to generate pseudo-labels, which then guide the policy optimization process. This design choice offers a significant advantage by decoupling the learning target from the rapidly changing policy, thereby preventing catastrophic performance degradation and ensuring a more robust training trajectory.

Quick Start & Requirements

Prepare the MATH dataset using the provided script: python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B
Initiate training by running the bash script: bash math_intuitor.sh (Note: Requires setting your WANDB API key in math_intuitor.sh).

Prerequisites: MATH dataset, WANDB API key.
Dependencies: The project builds upon intuitor, open-r1, and verl repositories. The Qwen2.5-3B model is specified for data preprocessing.
Links: No direct links to official quick-start guides, documentation, or demos are provided in the README.

Highlighted Details

Successfully mitigates "policy collapse," a common instability in self-supervised RL for LLMs, ensuring stable training dynamics.
Empirically validated on the challenging MATH dataset, demonstrating consistent high validation accuracy and stable training rewards.
Provides a robust framework for enhancing LLM reasoning capabilities without requiring expensive human-annotated datasets.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

License: Apache License 2.0.
Compatibility: The Apache License 2.0 is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README does not explicitly detail any limitations, unsupported platforms, alpha status, or known bugs. The setup process requires specific data preparation steps and the use of a Weights & Biases (WANDB) API key for training.

M_GRPO by baibizhe

Explore Similar Projects

Label-Free-RLVR by QingyangZhang

ToRL by GAIR-NLP

Tool-Star by RUC-NLPIR

LightReasoner by HKUDS

Tina by shangshang-wang

LUFFY by ElliottYan

R-Zero by Chengsong-Huang

MiMo by XiaomiMiMo

rStar by zhentingqi

train-deepseek-r1 by FareedKhan-dev

Awesome-LLM-Post-training by mbzuai-oryx

Logic-RL by Unakar