awesome-on-policy-distillation by chrisliu298

On-policy distillation techniques for LLM training and alignment

Created 4 months ago

547 stars

Top 57.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

Summary

This repository curates resources on On-Policy Distillation (OPD), a technique for training Large Language Models (LLMs) by having a student model learn from its own generated samples, guided by a teacher model. It addresses the train-inference distribution gap prevalent in off-policy distillation and supervised fine-tuning. Aimed at researchers and engineers, OPD offers a powerful post-training primitive adopted by major AI labs.

How It Works

OPD trains a student LLM using trajectories sampled from its own evolving policy, with a teacher model providing dense, token-level supervision. This on-policy data reduces the distribution mismatch between training and inference, contrasting with off-policy methods. It can be conceptualized as reinforcement learning with teacher-defined rewards or Generative Knowledge Distillation (GKD) on student rollouts.

Quick Start & Requirements

This is a curated collection, not a single installable project. Users should consult the "Frameworks and Implementations" section for tools like TRL, NeMo-RL, and KDFlow. Specific requirements depend on the chosen framework; links to official documentation are provided.

Highlighted Details

OPD is a standard post-training primitive used by Alibaba (Qwen3), DeepSeek (V4), Xiaomi (MiMo), Zhipu (GLM-5), and NVIDIA (Nemotron-Cascade 2).
The collection covers numerous OPD variants: black-box, self-distillation, cross-tokenizer, efficiency optimizations, and applications in agents, multimodal models, and diffusion.
"Industrial recipes" detail production-level OPD pipelines.
Key implementation frameworks like TRL, NeMo-RL, and KDFlow are listed.

Maintenance & Community

This "Awesome" list repository curates research and resources, acknowledging parallel efforts and providing contribution guidelines. Direct community channels (e.g., Discord, Slack) are not explicitly listed.

Licensing & Compatibility

The collection itself lacks a specified license. Users must review the individual licenses of linked papers and frameworks for commercial use or closed-source compatibility.

Limitations & Caveats

As a curated list, this repository requires users to select and integrate specific frameworks or papers. The field is rapidly evolving, with many 2026 papers addressing known failure modes like instability or diversity collapse, necessitating careful evaluation of chosen OPD techniques.

Health Check

Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

130 stars in the last 30 days