Awesome-LLM-On-Policy-Distillation by nick7nlp

On-Policy Distillation for Large Language Models

Created 3 months ago

461 stars

Top 64.9% on SourcePulse

Project Summary

This repository serves as a comprehensive, curated collection of resources on On-Policy Distillation (OPD) for Large Language Models (LLMs). It addresses the critical challenge of compounding errors in LLM reasoning, offering a structured pathway for researchers, engineers, and power users to understand, implement, and advance this vital post-training paradigm. The primary benefit is a centralized, up-to-date knowledge base that accelerates adoption and innovation in developing more robust and capable LLMs.

How It Works

On-Policy Distillation (OPD) tackles the exposure bias inherent in traditional off-policy methods like Supervised Fine-Tuning (SFT). Instead of learning from static teacher demonstrations, OPD requires the student model to generate its own data trajectories. These self-generated trajectories are then evaluated by a teacher model, reward model, or verifier, providing a dense, token-level supervision signal. This approach allows the student to learn from its own mistakes within its own generative distribution, proving indispensable for scaling LLMs with complex reasoning capabilities.

Quick Start & Requirements

This repository is a curated resource hub, not a runnable codebase. The primary starting points are the comprehensive survey paper available on arXiv and the companion OPDHub website, which offers full-text search and multi-axis filtering of OPD methods. Recommended reading orders and specific paper suggestions are provided for users with different backgrounds or focusing on particular tasks like math reasoning or agent development.

Highlighted Details

OPDHub: A dedicated companion site featuring full-text search and advanced filtering capabilities for all indexed OPD resources.
Comprehensive Survey: An evolving survey paper (V3 released) detailing OPD taxonomy, method selection, landscape, and theoretical frameworks.
Model Atlas: Maps teacher-student model pairings across 170 papers, revealing ecosystem trends and dominant models like the Qwen family.
Industrial Adoption: Highlights the integration of OPD into frontier models such as DeepSeek-V4, Qwen3, Gemma-2, Nemotron, and MiMo.
Key Trends: Identifies shifts towards adaptive objectives, the rise of self-distillation, token importance weighting, agentic OPD, and concerns around diversity collapse.

Maintenance & Community

The repository is actively maintained, with recent updates noted in June 2026. Contributions are welcomed via Pull Requests and Issues, fostering a collaborative environment for expanding and refining the collection.

Licensing & Compatibility

The repository itself does not specify a license. Users should refer to the individual papers and resources linked within for their respective licensing terms and compatibility constraints, particularly concerning commercial use.

Limitations & Caveats

As a curated list of research papers and resources, this repository does not provide direct code for implementing OPD. Users must consult the cited papers for implementation details. The field of OPD is rapidly evolving, requiring continuous updates to the collection. The complexity of OPD methods necessitates a strong background in machine learning and LLM training.

Awesome-LLM-On-Policy-Distillation by nick7nlp

Explore Similar Projects

DLLM-Survey by LiQiiiii

D-OPSD by vvvvvjdy

AwesomeOPD by thinkwee

awesome-on-policy-distillation by chrisliu298

llm-continual-learning-survey by Wang-ML-Lab

DiffuLLaMA by HKUNLP

OPSD by siyan-zhao

HugNLP by HugAILab

Self-Distillation by idanshen

SDPO by lasgroup

tinker-cookbook by thinking-machines-lab

Awesome-Incremental-Learning by xialeiliu