OPSD by siyan-zhao

On-policy self-distillation for large language models

Created 3 months ago

461 stars

Top 64.9% on SourcePulse

Project Summary

On-Policy Self-Distillation (OPSD) addresses the challenge of training large language models (LLMs) to improve reasoning capabilities by enabling a single model to act as both student and teacher. This approach is designed for researchers and practitioners aiming to enhance LLM performance on complex tasks, offering a method to distill knowledge from a model's own reasoning process without requiring external teacher models. The primary benefit is improved performance through efficient self-supervision.

How It Works

OPSD employs on-policy self-distillation, where the model is trained to match token-level distributions. The core mechanism involves conditioning the model on two contexts: the student sees only the problem, while the teacher additionally sees the ground-truth solution. By performing token-level distribution matching along the student's own generated trajectories, the model learns to improve its reasoning. This approach avoids the need for separate teacher models and leverages the model's internal states for distillation.

Quick Start & Requirements

Installation involves creating a Conda environment from environment.yml and activating it, followed by pip install flash-attn==2.8.3 --no-build-isolation. Users may need to select a flash-attn version compatible with their CUDA and PyTorch setup. The implementation builds upon trl's experimental GOLD trainer. Example launch scripts for OPSD, SFT, and GRPO training, as well as evaluation scripts, are provided in the scripts/ and eval/ directories. Training is notably fast, with OPSD on Qwen3-1.7B reportedly taking ~15 minutes on 4xH100 GPUs, peaking within 100 steps.

Highlighted Details

Achieves significant performance improvements, e.g., reaching 57.2% Avg@12 on AIME24 for Qwen3-1.7B after 100 steps.
Introduces a novel training stabilization strategy: per-token point-wise KL clipping, which mitigates instability caused by high KL divergence in style tokens.
Supports both "thinking" (reasoning-enabled) and "non-thinking" modes, with the latter offering faster evaluation.
Offers a --fixed_teacher option, utilizing PEFT/LoRA, to prevent the teacher policy from updating, potentially improving stability.

Maintenance & Community

The project saw its initial code release on March 3, 2026, with updated code and experimental results released on March 18, 2026. The updates included bug fixes and the addition of the KL clipping stabilization strategy. Acknowledgements mention contributors who identified critical bugs. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The provided README does not specify a software license. This lack of explicit licensing information presents a significant adoption blocker, as it leaves the terms of use, modification, and distribution unclear, particularly for commercial applications.

Limitations & Caveats

The project description is minimal ("None"). The --use_tinker_loss option is experimental, potentially unstable, and lacks clipping implementation. Using a fixed teacher requires --use_peft; disabling PEFT may lead to training instability. The effectiveness and stability of the KL clipping strategy may vary across different models and datasets beyond those presented.

OPSD by siyan-zhao

Explore Similar Projects

Awesome-LLM-On-Policy-Distillation by nick7nlp

awesome-on-policy-distillation by chrisliu298

G-OPD by RUCBM

AwesomeOPD by thinkwee

LightReasoner by HKUDS

dynamic-cheatsheet by suzgunmirac

Slow_Thinking_with_LLMs by RUCAIBox

Awesome-Knowledge-Distillation-of-LLMs by Tebmer

M_GRPO by baibizhe

SDPO by lasgroup

tinker-cookbook by thinking-machines-lab

InternLM by InternLM