OPD  by thunlp

Improving LLM on-policy distillation through mechanism analysis

Created 1 month ago
464 stars

Top 64.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project addresses poorly understood dynamics and failures in on-policy distillation (OPD) of large language models. It offers a systematic investigation into OPD phenomenology, mechanisms, and practical recipes for success, benefiting researchers and practitioners. The work provides actionable strategies to recover failing distillation and deepens understanding of token-level alignment.

How It Works

The core approach systematically investigates OPD dynamics, identifying two critical success conditions: compatible student-teacher "thinking patterns" and the teacher providing genuinely novel capabilities. The mechanism is characterized by progressive alignment on high-probability tokens (97%-99% concentration) within student-visited states. Practical strategies like "off-policy cold start" and "teacher-aligned prompt selection" are proposed to recover failing OPD. This mechanistic understanding and practical recipe offer a novel contribution.

Quick Start & Requirements

Environment setup requires distinct configurations for OPD/RL (using verl v0.7.0, Python 3.12, vllm, sglang, mcore) and SFT training (using LlamaFactory v0.9.5, Python 3.11). Key commands include bash scripts/install_vllm_sglang_mcore.sh and pip install -e .. Experiments utilized 8 x NVIDIA A800 80GB GPUs, indicating substantial hardware requirements. Links to the paper (arXiv:2604.13016) and various setup/inference scripts are provided.

Highlighted Details

  • Identifies critical OPD success factors: compatible thinking patterns and teacher's novel capabilities.
  • Characterizes successful OPD via token-level alignment on high-probability tokens (97%-99%).
  • Proposes practical recovery strategies: "off-policy cold start" and "teacher-aligned prompt selection."
  • Investigates scalability limits of dense token-level rewards for long-horizon distillation.
  • Releases SFT checkpoint Qwen3-1.7B-SFT and RL checkpoint Qwen3-4B-Base-GRPO.

Maintenance & Community

Recent activity (April 2026 news) indicates ongoing development. Primary contacts are Bingxiang He (hebx24@mails.tsinghua.edu.cn) and Ning Ding (dingning@mail.tsinghua.edu.cn). No community channels or roadmap links are provided.

Licensing & Compatibility

The license type is not specified in the README, requiring clarification for adoption decisions, particularly concerning commercial use or closed-source integration.

Limitations & Caveats

Concerns are raised regarding the scalability of dense token-level rewards for long-horizon distillation. Significant hardware resources (multiple high-end GPUs) are implied. Setup involves managing complex, distinct environments (verl, LlamaFactory). The absence of a specified license is a notable caveat.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
321 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.