OPD by thunlp

Improving LLM on-policy distillation through mechanism analysis

Created 3 months ago

782 stars

Top 44.0% on SourcePulse

Project Summary

Summary

This project addresses poorly understood dynamics and failures in on-policy distillation (OPD) of large language models. It offers a systematic investigation into OPD phenomenology, mechanisms, and practical recipes for success, benefiting researchers and practitioners. The work provides actionable strategies to recover failing distillation and deepens understanding of token-level alignment.

How It Works

The core approach systematically investigates OPD dynamics, identifying two critical success conditions: compatible student-teacher "thinking patterns" and the teacher providing genuinely novel capabilities. The mechanism is characterized by progressive alignment on high-probability tokens (97%-99% concentration) within student-visited states. Practical strategies like "off-policy cold start" and "teacher-aligned prompt selection" are proposed to recover failing OPD. This mechanistic understanding and practical recipe offer a novel contribution.

Quick Start & Requirements

Environment setup requires distinct configurations for OPD/RL (using verl v0.7.0, Python 3.12, vllm, sglang, mcore) and SFT training (using LlamaFactory v0.9.5, Python 3.11). Key commands include bash scripts/install_vllm_sglang_mcore.sh and pip install -e .. Experiments utilized 8 x NVIDIA A800 80GB GPUs, indicating substantial hardware requirements. Links to the paper (arXiv:2604.13016) and various setup/inference scripts are provided.

Highlighted Details

Identifies critical OPD success factors: compatible thinking patterns and teacher's novel capabilities.
Characterizes successful OPD via token-level alignment on high-probability tokens (97%-99%).
Proposes practical recovery strategies: "off-policy cold start" and "teacher-aligned prompt selection."
Investigates scalability limits of dense token-level rewards for long-horizon distillation.
Releases SFT checkpoint Qwen3-1.7B-SFT and RL checkpoint Qwen3-4B-Base-GRPO.

Maintenance & Community

Recent activity (April 2026 news) indicates ongoing development. Primary contacts are Bingxiang He (hebx24@mails.tsinghua.edu.cn) and Ning Ding (dingning@mail.tsinghua.edu.cn). No community channels or roadmap links are provided.

Licensing & Compatibility

The license type is not specified in the README, requiring clarification for adoption decisions, particularly concerning commercial use or closed-source integration.

Limitations & Caveats

Concerns are raised regarding the scalability of dense token-level rewards for long-horizon distillation. Significant hardware resources (multiple high-end GPUs) are implied. Setup involves managing complex, distinct environments (verl, LlamaFactory). The absence of a specified license is a notable caveat.

OPD by thunlp

Explore Similar Projects

awesome-on-policy-distillation by chrisliu298

G-OPD by RUCBM

AwesomeOPD by thinkwee

distillm by jongwooko

distill-sd by segmind

OPSD by siyan-zhao

Awesome-Knowledge-Distillation-of-LLMs by Tebmer

Self-Distillation by idanshen

distilling-step-by-step by google-research

SDPO by lasgroup

DistillKit by arcee-ai

mdistiller by megvii-research