Discover and explore top open-source AI tools and projects—updated daily.
Stabilizing LLM reasoning with self-supervised RL
Top 88.4% on SourcePulse
M-GRPO tackles the critical challenge of instability, termed "policy collapse," in self-supervised reinforcement learning (RL) for Large Language Models (LLMs). It offers a solution for researchers and engineers aiming to enhance LLM reasoning capabilities using RL without relying on costly human-annotated data, providing a more stable and effective training paradigm.
How It Works
M-GRPO introduces a novel momentum-anchored approach to stabilize self-supervised reinforcement learning (RL) for Large Language Models (LLMs). It addresses the critical issue of "policy collapse," where training becomes unstable due to a lack of a consistent target signal in self-rewarding systems. The core innovation is a momentum model that evolves slowly, providing a stable and reliable reference point. This momentum-anchored signal is used to generate pseudo-labels, which then guide the policy optimization process. This design choice offers a significant advantage by decoupling the learning target from the rapidly changing policy, thereby preventing catastrophic performance degradation and ensuring a more robust training trajectory.
Quick Start & Requirements
python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B
bash math_intuitor.sh
(Note: Requires setting your WANDB API key in math_intuitor.sh
).intuitor
, open-r1
, and verl
repositories. The Qwen2.5-3B
model is specified for data preprocessing.Highlighted Details
Maintenance & Community
The provided README does not contain specific details regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.
Licensing & Compatibility
Limitations & Caveats
The README does not explicitly detail any limitations, unsupported platforms, alpha status, or known bugs. The setup process requires specific data preparation steps and the use of a Weights & Biases (WANDB) API key for training.
1 week ago
Inactive