SDPO by lasgroup

Self-distillation framework for reinforcement learning with rich feedback

Created 5 months ago

1,002 stars

Top 36.5% on SourcePulse

Project Summary

SDPO addresses the credit-assignment bottleneck in reinforcement learning for verifiable domains by leveraging rich textual feedback. Targeting researchers and practitioners working with LLMs, it enables denser learning signals and improved performance by converting feedback into a self-distillation process, reducing reliance on sparse scalar rewards.

How It Works

SDPO formalizes Reinforcement Learning with Rich Feedback (RLRF). Its core innovation is self-distillation, where the model acts as its own teacher. It distills feedback-informed next-token predictions back into the policy, enabling the model to retrospectively identify and correct mistakes. This approach can also reuse high-reward trajectories as implicit feedback when rich environment data is scarce, leading to denser supervision and more stable training.

Quick Start & Requirements

Requires Linux, NVIDIA GPUs (CUDA compatible), and Python 3.12. Installation is recommended via Docker for HPC environments, with specific build instructions provided for NVIDIA GH200 clusters. Local installation involves setting up PyTorch (e.g., CUDA 12.4 compatible), installing core dependencies via requirements.txt, and then installing SDPO in editable mode (pip install -e .). Flash Attention 2 compilation is also required. Official documentation links are available for the paper, code, and Weights & Biases logs.

Highlighted Details

Novel self-distillation framework converts tokenized feedback into dense learning signals without external reward models.
Enables test-time self-distillation for iterative output refinement at inference, boosting performance on hard reasoning tasks without retraining.
Achieves faster convergence and stable training by incorporating rich environment feedback (e.g., runtime errors) directly into the learning process.
Outperforms baseline RL methods like GRPO on reasoning benchmarks and when rich feedback is available.

Maintenance & Community

The project is based on the verl framework. Notable authors are listed in the citation. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The license type is not specified in the README, posing a significant adoption blocker. Support for the Blackwell architecture (RTX 50 series, B100/B200) is explicitly marked as not fully tested.

Limitations & Caveats

Blackwell GPU architecture support is experimental and not fully validated. The absence of a clearly stated license in the README is a critical omission for evaluating commercial or broader adoption potential.

SDPO by lasgroup

Explore Similar Projects

Awesome-LLM-On-Policy-Distillation by nick7nlp

awesome-on-policy-distillation by chrisliu298

G-OPD by RUCBM

awesome-in-context-rl by dunnolab

AwesomeOPD by thinkwee

Awesome-GRPO by WangJingyao07

machina by DeepX-inc

OPSD by siyan-zhao

Intuitor by sunblaze-ucb

self-rewarding-lm-pytorch by lucidrains

M_GRPO by baibizhe

Self-Distillation by idanshen