SDPO  by lasgroup

Self-distillation framework for reinforcement learning with rich feedback

Created 1 month ago
484 stars

Top 63.5% on SourcePulse

GitHubView on GitHub
Project Summary

SDPO addresses the credit-assignment bottleneck in reinforcement learning for verifiable domains by leveraging rich textual feedback. Targeting researchers and practitioners working with LLMs, it enables denser learning signals and improved performance by converting feedback into a self-distillation process, reducing reliance on sparse scalar rewards.

How It Works

SDPO formalizes Reinforcement Learning with Rich Feedback (RLRF). Its core innovation is self-distillation, where the model acts as its own teacher. It distills feedback-informed next-token predictions back into the policy, enabling the model to retrospectively identify and correct mistakes. This approach can also reuse high-reward trajectories as implicit feedback when rich environment data is scarce, leading to denser supervision and more stable training.

Quick Start & Requirements

Requires Linux, NVIDIA GPUs (CUDA compatible), and Python 3.12. Installation is recommended via Docker for HPC environments, with specific build instructions provided for NVIDIA GH200 clusters. Local installation involves setting up PyTorch (e.g., CUDA 12.4 compatible), installing core dependencies via requirements.txt, and then installing SDPO in editable mode (pip install -e .). Flash Attention 2 compilation is also required. Official documentation links are available for the paper, code, and Weights & Biases logs.

Highlighted Details

  • Novel self-distillation framework converts tokenized feedback into dense learning signals without external reward models.
  • Enables test-time self-distillation for iterative output refinement at inference, boosting performance on hard reasoning tasks without retraining.
  • Achieves faster convergence and stable training by incorporating rich environment feedback (e.g., runtime errors) directly into the learning process.
  • Outperforms baseline RL methods like GRPO on reasoning benchmarks and when rich feedback is available.

Maintenance & Community

The project is based on the verl framework. Notable authors are listed in the citation. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The license type is not specified in the README, posing a significant adoption blocker. Support for the Blackwell architecture (RTX 50 series, B100/B200) is explicitly marked as not fully tested.

Limitations & Caveats

Blackwell GPU architecture support is experimental and not fully validated. The absence of a clearly stated license in the README is a critical omission for evaluating commercial or broader adoption potential.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
17
Star History
492 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

self-rewarding-lm-pytorch by lucidrains

0%
1k
Training framework for self-rewarding language models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

Eureka by eureka-research

0.0%
3k
LLM-based reward design for reinforcement learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.