Discover and explore top open-source AI tools and projects—updated daily.
thunlpImproving LLM on-policy distillation through mechanism analysis
Top 64.8% on SourcePulse
Summary
This project addresses poorly understood dynamics and failures in on-policy distillation (OPD) of large language models. It offers a systematic investigation into OPD phenomenology, mechanisms, and practical recipes for success, benefiting researchers and practitioners. The work provides actionable strategies to recover failing distillation and deepens understanding of token-level alignment.
How It Works
The core approach systematically investigates OPD dynamics, identifying two critical success conditions: compatible student-teacher "thinking patterns" and the teacher providing genuinely novel capabilities. The mechanism is characterized by progressive alignment on high-probability tokens (97%-99% concentration) within student-visited states. Practical strategies like "off-policy cold start" and "teacher-aligned prompt selection" are proposed to recover failing OPD. This mechanistic understanding and practical recipe offer a novel contribution.
Quick Start & Requirements
Environment setup requires distinct configurations for OPD/RL (using verl v0.7.0, Python 3.12, vllm, sglang, mcore) and SFT training (using LlamaFactory v0.9.5, Python 3.11). Key commands include bash scripts/install_vllm_sglang_mcore.sh and pip install -e .. Experiments utilized 8 x NVIDIA A800 80GB GPUs, indicating substantial hardware requirements. Links to the paper (arXiv:2604.13016) and various setup/inference scripts are provided.
Highlighted Details
Qwen3-1.7B-SFT and RL checkpoint Qwen3-4B-Base-GRPO.Maintenance & Community
Recent activity (April 2026 news) indicates ongoing development. Primary contacts are Bingxiang He (hebx24@mails.tsinghua.edu.cn) and Ning Ding (dingning@mail.tsinghua.edu.cn). No community channels or roadmap links are provided.
Licensing & Compatibility
The license type is not specified in the README, requiring clarification for adoption decisions, particularly concerning commercial use or closed-source integration.
Limitations & Caveats
Concerns are raised regarding the scalability of dense token-level rewards for long-horizon distillation. Significant hardware resources (multiple high-end GPUs) are implied. Setup involves managing complex, distinct environments (verl, LlamaFactory). The absence of a specified license is a notable caveat.
2 weeks ago
Inactive
segmind
google-research
arcee-ai
thinking-machines-lab