Discover and explore top open-source AI tools and projects—updated daily.
lasgroupSelf-distillation framework for reinforcement learning with rich feedback
Top 63.5% on SourcePulse
SDPO addresses the credit-assignment bottleneck in reinforcement learning for verifiable domains by leveraging rich textual feedback. Targeting researchers and practitioners working with LLMs, it enables denser learning signals and improved performance by converting feedback into a self-distillation process, reducing reliance on sparse scalar rewards.
How It Works
SDPO formalizes Reinforcement Learning with Rich Feedback (RLRF). Its core innovation is self-distillation, where the model acts as its own teacher. It distills feedback-informed next-token predictions back into the policy, enabling the model to retrospectively identify and correct mistakes. This approach can also reuse high-reward trajectories as implicit feedback when rich environment data is scarce, leading to denser supervision and more stable training.
Quick Start & Requirements
Requires Linux, NVIDIA GPUs (CUDA compatible), and Python 3.12. Installation is recommended via Docker for HPC environments, with specific build instructions provided for NVIDIA GH200 clusters. Local installation involves setting up PyTorch (e.g., CUDA 12.4 compatible), installing core dependencies via requirements.txt, and then installing SDPO in editable mode (pip install -e .). Flash Attention 2 compilation is also required. Official documentation links are available for the paper, code, and Weights & Biases logs.
Highlighted Details
Maintenance & Community
The project is based on the verl framework. Notable authors are listed in the citation. No specific community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The license type is not specified in the README, posing a significant adoption blocker. Support for the Blackwell architecture (RTX 50 series, B100/B200) is explicitly marked as not fully tested.
Limitations & Caveats
Blackwell GPU architecture support is experimental and not fully validated. The absence of a clearly stated license in the README is a critical omission for evaluating commercial or broader adoption potential.
1 week ago
Inactive
lucidrains
williamFalcon
coreylynch
eureka-research