Discover and explore top open-source AI tools and projects—updated daily.
NVlabsReinforcement learning optimization for multi-reward tasks
Top 85.5% on SourcePulse
GDPO is a novel reinforcement learning optimization method designed to address the reward advantage collapse issue prevalent in multi-reward training scenarios, particularly when using Group Relative Policy Optimization (GRPO). It targets researchers and practitioners working with complex RL tasks involving multiple, potentially conflicting, reward signals, offering improved training stability and downstream performance.
How It Works
GDPO's core innovation lies in decoupling the normalization of individual rewards, rather than applying a single normalization across all rewards as GRPO does. This approach preserves the distinct relative differences between rewards, providing a more accurate and higher-resolution training signal. This allows for more faithful preference optimization, leading to more robust convergence and better performance on tasks requiring nuanced reward interpretation. Implementations are provided for popular RL frameworks like VERL, TRL, and Nemo-RL.
Quick Start & Requirements
verl-GDPO, trl-GDPO, and nemo_rl-GDPO. These include easy-to-use, slurm-free training scripts.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 weeks ago
Inactive
0russwest0
PRIME-RL
modestyachts
alibaba
eureka-research