OPSD  by siyan-zhao

On-policy self-distillation for large language models

Created 2 months ago
257 stars

Top 98.3% on SourcePulse

GitHubView on GitHub
Project Summary

On-Policy Self-Distillation (OPSD) addresses the challenge of training large language models (LLMs) to improve reasoning capabilities by enabling a single model to act as both student and teacher. This approach is designed for researchers and practitioners aiming to enhance LLM performance on complex tasks, offering a method to distill knowledge from a model's own reasoning process without requiring external teacher models. The primary benefit is improved performance through efficient self-supervision.

How It Works

OPSD employs on-policy self-distillation, where the model is trained to match token-level distributions. The core mechanism involves conditioning the model on two contexts: the student sees only the problem, while the teacher additionally sees the ground-truth solution. By performing token-level distribution matching along the student's own generated trajectories, the model learns to improve its reasoning. This approach avoids the need for separate teacher models and leverages the model's internal states for distillation.

Quick Start & Requirements

Installation involves creating a Conda environment from environment.yml and activating it, followed by pip install flash-attn==2.8.3 --no-build-isolation. Users may need to select a flash-attn version compatible with their CUDA and PyTorch setup. The implementation builds upon trl's experimental GOLD trainer. Example launch scripts for OPSD, SFT, and GRPO training, as well as evaluation scripts, are provided in the scripts/ and eval/ directories. Training is notably fast, with OPSD on Qwen3-1.7B reportedly taking ~15 minutes on 4xH100 GPUs, peaking within 100 steps.

Highlighted Details

  • Achieves significant performance improvements, e.g., reaching 57.2% Avg@12 on AIME24 for Qwen3-1.7B after 100 steps.
  • Introduces a novel training stabilization strategy: per-token point-wise KL clipping, which mitigates instability caused by high KL divergence in style tokens.
  • Supports both "thinking" (reasoning-enabled) and "non-thinking" modes, with the latter offering faster evaluation.
  • Offers a --fixed_teacher option, utilizing PEFT/LoRA, to prevent the teacher policy from updating, potentially improving stability.

Maintenance & Community

The project saw its initial code release on March 3, 2026, with updated code and experimental results released on March 18, 2026. The updates included bug fixes and the addition of the KL clipping stabilization strategy. Acknowledgements mention contributors who identified critical bugs. No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The provided README does not specify a software license. This lack of explicit licensing information presents a significant adoption blocker, as it leaves the terms of use, modification, and distribution unclear, particularly for commercial applications.

Limitations & Caveats

The project description is minimal ("None"). The --use_tinker_loss option is experimental, potentially unstable, and lacks clipping implementation. Using a fixed teacher requires --use_peft; disabling PEFT may lead to training instability. The effectiveness and stability of the KL clipping strategy may vary across different models and datasets beyond those presented.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
127 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhang Eric Zhang(Founding Engineer at Modal), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
3 more.

tunix by google

0.5%
2k
JAX-native library for efficient LLM post-training
Created 1 year ago
Updated 4 hours ago
Feedback? Help us improve.