Awesome-LLM-On-Policy-Distillation  by nick7nlp

On-Policy Distillation for Large Language Models

Created 2 months ago
285 stars

Top 91.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive, curated collection of resources on On-Policy Distillation (OPD) for Large Language Models (LLMs). It addresses the critical challenge of compounding errors in LLM reasoning, offering a structured pathway for researchers, engineers, and power users to understand, implement, and advance this vital post-training paradigm. The primary benefit is a centralized, up-to-date knowledge base that accelerates adoption and innovation in developing more robust and capable LLMs.

How It Works

On-Policy Distillation (OPD) tackles the exposure bias inherent in traditional off-policy methods like Supervised Fine-Tuning (SFT). Instead of learning from static teacher demonstrations, OPD requires the student model to generate its own data trajectories. These self-generated trajectories are then evaluated by a teacher model, reward model, or verifier, providing a dense, token-level supervision signal. This approach allows the student to learn from its own mistakes within its own generative distribution, proving indispensable for scaling LLMs with complex reasoning capabilities.

Quick Start & Requirements

This repository is a curated resource hub, not a runnable codebase. The primary starting points are the comprehensive survey paper available on arXiv and the companion OPDHub website, which offers full-text search and multi-axis filtering of OPD methods. Recommended reading orders and specific paper suggestions are provided for users with different backgrounds or focusing on particular tasks like math reasoning or agent development.

Highlighted Details

  • OPDHub: A dedicated companion site featuring full-text search and advanced filtering capabilities for all indexed OPD resources.
  • Comprehensive Survey: An evolving survey paper (V3 released) detailing OPD taxonomy, method selection, landscape, and theoretical frameworks.
  • Model Atlas: Maps teacher-student model pairings across 170 papers, revealing ecosystem trends and dominant models like the Qwen family.
  • Industrial Adoption: Highlights the integration of OPD into frontier models such as DeepSeek-V4, Qwen3, Gemma-2, Nemotron, and MiMo.
  • Key Trends: Identifies shifts towards adaptive objectives, the rise of self-distillation, token importance weighting, agentic OPD, and concerns around diversity collapse.

Maintenance & Community

The repository is actively maintained, with recent updates noted in June 2026. Contributions are welcomed via Pull Requests and Issues, fostering a collaborative environment for expanding and refining the collection.

Licensing & Compatibility

The repository itself does not specify a license. Users should refer to the individual papers and resources linked within for their respective licensing terms and compatibility constraints, particularly concerning commercial use.

Limitations & Caveats

As a curated list of research papers and resources, this repository does not provide direct code for implementing OPD. Users must consult the cited papers for implementation details. The field of OPD is rapidly evolving, requiring continuous updates to the collection. The complexity of OPD methods necessitates a strong background in machine learning and LLM training.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
269 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 3 years ago
Updated 1 year ago
Feedback? Help us improve.