GDPO by NVlabs

Reinforcement learning optimization for multi-reward tasks

Created 3 months ago

412 stars

Top 71.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

GDPO is a novel reinforcement learning optimization method designed to address the reward advantage collapse issue prevalent in multi-reward training scenarios, particularly when using Group Relative Policy Optimization (GRPO). It targets researchers and practitioners working with complex RL tasks involving multiple, potentially conflicting, reward signals, offering improved training stability and downstream performance.

How It Works

GDPO's core innovation lies in decoupling the normalization of individual rewards, rather than applying a single normalization across all rewards as GRPO does. This approach preserves the distinct relative differences between rewards, providing a more accurate and higher-resolution training signal. This allows for more faithful preference optimization, leading to more robust convergence and better performance on tasks requiring nuanced reward interpretation. Implementations are provided for popular RL frameworks like VERL, TRL, and Nemo-RL.

Quick Start & Requirements

Primary install / run command: Implementations are available within specific framework directories: verl-GDPO, trl-GDPO, and nemo_rl-GDPO. These include easy-to-use, slurm-free training scripts.
Non-default prerequisites and dependencies: Requires significant GPU resources. Tested with 8x A100 GPUs or a single A100 GPU. Python environment is implied.
Estimated setup time or resource footprint: Training runs can be completed in approximately 1 hour on 8x A100 GPUs, or around 2.5 hours on a single A100 GPU.
Official quick-start, docs, demo, or other relevant pages:
- Hugging Face: https://huggingface.co/NVlabs/GDPO
- Paper: https://arxiv.org/abs/2601.05242
- VERL Implementation: https://github.com/NVlabs/GDPO/tree/main/verl-GDPO
- TRL Implementation: https://github.com/NVlabs/GDPO/tree/main/trl-GDPO
- Nemo-RL Implementation: https://github.com/NVlabs/GDPO/tree/main/nemo_rl-GDPO

Highlighted Details

GDPO consistently surpasses GRPO in both training convergence and downstream evaluation performance across tool calling, math reasoning, and code generation tasks.
Demonstrated effectiveness on Qwen2.5-1.5B-Instruct for tool calling (4k samples, 100 steps) and math reasoning (GSM8K dataset, 1 epoch).
Positioned as a straightforward drop-in replacement for GRPO within TRL and VERL frameworks.

Maintenance & Community

The project is copyrighted by NVIDIA Corporation. No specific community channels (e.g., Discord, Slack) or a public roadmap were detailed in the provided text.

Licensing & Compatibility

License type: NVIDIA Source Code License-NC.
Compatibility notes: The non-commercial (NC) clause restricts usage to non-commercial applications. Compatibility with closed-source projects may be limited by this license.

Limitations & Caveats

The primary limitation is the NVIDIA Source Code License-NC, which prohibits commercial use.
Implementations are tied to specific frameworks (VERL, TRL, Nemo-RL), potentially requiring integration effort if not already using these.
The associated paper is an arXiv preprint, indicating it may not have undergone formal peer review.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

26 stars in the last 30 days