Improving SFT generalization with reward rectification
New!
Top 90.3% on SourcePulse
This repository introduces Dynamic Fine-Tuning (DFT), a method to improve the generalization of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs). It addresses the limitations of standard SFT by proposing a theoretically motivated reward rectification technique, offering a simpler yet effective alternative to reinforcement learning for certain tasks. The target audience includes LLM researchers and practitioners seeking to enhance SFT performance.
How It Works
DFT modifies the SFT objective by dynamically rescaling each token's loss by its predicted probability. This "reward rectification" stabilizes gradient updates, preventing the implicit problematic reward structure that can hinder generalization in standard SFT. The approach is implemented as a single-line code change, making it easy to integrate into existing SFT pipelines.
Quick Start & Requirements
conda
and pip
. The setup involves creating a Conda environment and installing vllm
, sglang
, and mcore
.torchrun
, and evaluation. Links to Qwen2.5-Math repository for evaluation setup are provided.Highlighted Details
ms-swift
and has community reproductions.Maintenance & Community
The project is associated with the volcengine/verl
framework. Community reproductions and integration with ms-swift
suggest active interest.
Licensing & Compatibility
The repository does not explicitly state a license. The associated volcengine/verl
repository is Apache 2.0 licensed, but this specific project's licensing requires clarification for commercial use.
Limitations & Caveats
DFT performs best on tasks with non-deterministic solution trajectories (e.g., mathematical CoT, complex coding). Its performance is weaker on tasks with single, near-deterministic ground-truth answers and constrained CoT.
3 days ago
Inactive