Discover and explore top open-source AI tools and projects—updated daily.
siyan-zhaoOn-policy self-distillation for large language models
Top 98.3% on SourcePulse
On-Policy Self-Distillation (OPSD) addresses the challenge of training large language models (LLMs) to improve reasoning capabilities by enabling a single model to act as both student and teacher. This approach is designed for researchers and practitioners aiming to enhance LLM performance on complex tasks, offering a method to distill knowledge from a model's own reasoning process without requiring external teacher models. The primary benefit is improved performance through efficient self-supervision.
How It Works
OPSD employs on-policy self-distillation, where the model is trained to match token-level distributions. The core mechanism involves conditioning the model on two contexts: the student sees only the problem, while the teacher additionally sees the ground-truth solution. By performing token-level distribution matching along the student's own generated trajectories, the model learns to improve its reasoning. This approach avoids the need for separate teacher models and leverages the model's internal states for distillation.
Quick Start & Requirements
Installation involves creating a Conda environment from environment.yml and activating it, followed by pip install flash-attn==2.8.3 --no-build-isolation. Users may need to select a flash-attn version compatible with their CUDA and PyTorch setup. The implementation builds upon trl's experimental GOLD trainer. Example launch scripts for OPSD, SFT, and GRPO training, as well as evaluation scripts, are provided in the scripts/ and eval/ directories. Training is notably fast, with OPSD on Qwen3-1.7B reportedly taking ~15 minutes on 4xH100 GPUs, peaking within 100 steps.
Highlighted Details
--fixed_teacher option, utilizing PEFT/LoRA, to prevent the teacher policy from updating, potentially improving stability.Maintenance & Community
The project saw its initial code release on March 3, 2026, with updated code and experimental results released on March 18, 2026. The updates included bug fixes and the addition of the KL clipping stabilization strategy. Acknowledgements mention contributors who identified critical bugs. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The provided README does not specify a software license. This lack of explicit licensing information presents a significant adoption blocker, as it leaves the terms of use, modification, and distribution unclear, particularly for commercial applications.
Limitations & Caveats
The project description is minimal ("None"). The --use_tinker_loss option is experimental, potentially unstable, and lacks clipping implementation. Using a fixed teacher requires --use_peft; disabling PEFT may lead to training instability. The effectiveness and stability of the KL clipping strategy may vary across different models and datasets beyond those presented.
2 weeks ago
Inactive
google
thinking-machines-lab
InternLM