M_GRPO  by baibizhe

Stabilizing LLM reasoning with self-supervised RL

Created 1 month ago
301 stars

Top 88.4% on SourcePulse

GitHubView on GitHub
Project Summary

M-GRPO tackles the critical challenge of instability, termed "policy collapse," in self-supervised reinforcement learning (RL) for Large Language Models (LLMs). It offers a solution for researchers and engineers aiming to enhance LLM reasoning capabilities using RL without relying on costly human-annotated data, providing a more stable and effective training paradigm.

How It Works

M-GRPO introduces a novel momentum-anchored approach to stabilize self-supervised reinforcement learning (RL) for Large Language Models (LLMs). It addresses the critical issue of "policy collapse," where training becomes unstable due to a lack of a consistent target signal in self-rewarding systems. The core innovation is a momentum model that evolves slowly, providing a stable and reliable reference point. This momentum-anchored signal is used to generate pseudo-labels, which then guide the policy optimization process. This design choice offers a significant advantage by decoupling the learning target from the rapidly changing policy, thereby preventing catastrophic performance degradation and ensuring a more robust training trajectory.

Quick Start & Requirements

  1. Prepare the MATH dataset using the provided script: python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B
  2. Initiate training by running the bash script: bash math_intuitor.sh (Note: Requires setting your WANDB API key in math_intuitor.sh).
  • Prerequisites: MATH dataset, WANDB API key.
  • Dependencies: The project builds upon intuitor, open-r1, and verl repositories. The Qwen2.5-3B model is specified for data preprocessing.
  • Links: No direct links to official quick-start guides, documentation, or demos are provided in the README.

Highlighted Details

  • Successfully mitigates "policy collapse," a common instability in self-supervised RL for LLMs, ensuring stable training dynamics.
  • Empirically validated on the challenging MATH dataset, demonstrating consistent high validation accuracy and stable training rewards.
  • Provides a robust framework for enhancing LLM reasoning capabilities without requiring expensive human-annotated datasets.

Maintenance & Community

The provided README does not contain specific details regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: The Apache License 2.0 is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README does not explicitly detail any limitations, unsupported platforms, alpha status, or known bugs. The setup process requires specific data preparation steps and the use of a Weights & Biases (WANDB) API key for training.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
257 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.