GDPO  by NVlabs

Reinforcement learning optimization for multi-reward tasks

Created 1 month ago
318 stars

Top 85.5% on SourcePulse

GitHubView on GitHub
Project Summary

GDPO is a novel reinforcement learning optimization method designed to address the reward advantage collapse issue prevalent in multi-reward training scenarios, particularly when using Group Relative Policy Optimization (GRPO). It targets researchers and practitioners working with complex RL tasks involving multiple, potentially conflicting, reward signals, offering improved training stability and downstream performance.

How It Works

GDPO's core innovation lies in decoupling the normalization of individual rewards, rather than applying a single normalization across all rewards as GRPO does. This approach preserves the distinct relative differences between rewards, providing a more accurate and higher-resolution training signal. This allows for more faithful preference optimization, leading to more robust convergence and better performance on tasks requiring nuanced reward interpretation. Implementations are provided for popular RL frameworks like VERL, TRL, and Nemo-RL.

Quick Start & Requirements

Highlighted Details

  • GDPO consistently surpasses GRPO in both training convergence and downstream evaluation performance across tool calling, math reasoning, and code generation tasks.
  • Demonstrated effectiveness on Qwen2.5-1.5B-Instruct for tool calling (4k samples, 100 steps) and math reasoning (GSM8K dataset, 1 epoch).
  • Positioned as a straightforward drop-in replacement for GRPO within TRL and VERL frameworks.

Maintenance & Community

  • The project is copyrighted by NVIDIA Corporation. No specific community channels (e.g., Discord, Slack) or a public roadmap were detailed in the provided text.

Licensing & Compatibility

  • License type: NVIDIA Source Code License-NC.
  • Compatibility notes: The non-commercial (NC) clause restricts usage to non-commercial applications. Compatibility with closed-source projects may be limited by this license.

Limitations & Caveats

  • The primary limitation is the NVIDIA Source Code License-NC, which prohibits commercial use.
  • Implementations are tied to specific frameworks (VERL, TRL, Nemo-RL), potentially requiring integration effort if not already using these.
  • The associated paper is an arXiv preprint, indicating it may not have undergone formal peer review.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
321 stars in the last 30 days

Explore Similar Projects

Starred by Philipp Moritz Philipp Moritz(Cofounder of Anyscale), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
1 more.

ARS by modestyachts

0%
427
Reinforcement learning via augmented random search
Created 7 years ago
Updated 4 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
3 more.

ROLL by alibaba

1.6%
3k
RL library for large language models
Created 8 months ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

Eureka by eureka-research

0.1%
3k
LLM-based reward design for reinforcement learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.