awesome-on-policy-distillation  by chrisliu298

On-policy distillation techniques for LLM training and alignment

Created 2 months ago
310 stars

Top 86.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository curates resources on On-Policy Distillation (OPD), a technique for training Large Language Models (LLMs) by having a student model learn from its own generated samples, guided by a teacher model. It addresses the train-inference distribution gap prevalent in off-policy distillation and supervised fine-tuning. Aimed at researchers and engineers, OPD offers a powerful post-training primitive adopted by major AI labs.

How It Works

OPD trains a student LLM using trajectories sampled from its own evolving policy, with a teacher model providing dense, token-level supervision. This on-policy data reduces the distribution mismatch between training and inference, contrasting with off-policy methods. It can be conceptualized as reinforcement learning with teacher-defined rewards or Generative Knowledge Distillation (GKD) on student rollouts.

Quick Start & Requirements

This is a curated collection, not a single installable project. Users should consult the "Frameworks and Implementations" section for tools like TRL, NeMo-RL, and KDFlow. Specific requirements depend on the chosen framework; links to official documentation are provided.

Highlighted Details

  • OPD is a standard post-training primitive used by Alibaba (Qwen3), DeepSeek (V4), Xiaomi (MiMo), Zhipu (GLM-5), and NVIDIA (Nemotron-Cascade 2).
  • The collection covers numerous OPD variants: black-box, self-distillation, cross-tokenizer, efficiency optimizations, and applications in agents, multimodal models, and diffusion.
  • "Industrial recipes" detail production-level OPD pipelines.
  • Key implementation frameworks like TRL, NeMo-RL, and KDFlow are listed.

Maintenance & Community

This "Awesome" list repository curates research and resources, acknowledging parallel efforts and providing contribution guidelines. Direct community channels (e.g., Discord, Slack) are not explicitly listed.

Licensing & Compatibility

The collection itself lacks a specified license. Users must review the individual licenses of linked papers and frameworks for commercial use or closed-source compatibility.

Limitations & Caveats

As a curated list, this repository requires users to select and integrate specific frameworks or papers. The field is rapidly evolving, with many 2026 papers addressing known failure modes like instability or diversity collapse, necessitating careful evaluation of chosen OPD techniques.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
0
Star History
262 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhang Eric Zhang(Founding Engineer at Modal), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
3 more.

tunix by google

0.5%
2k
JAX-native library for efficient LLM post-training
Created 1 year ago
Updated 14 hours ago
Feedback? Help us improve.