Discover and explore top open-source AI tools and projects—updated daily.
yongliang-wuImproving SFT generalization with reward rectification
Top 62.0% on SourcePulse
This repository introduces Dynamic Fine-Tuning (DFT), a method to improve the generalization of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs). It addresses the limitations of standard SFT by proposing a theoretically motivated reward rectification technique, offering a simpler yet effective alternative to reinforcement learning for certain tasks. The target audience includes LLM researchers and practitioners seeking to enhance SFT performance.
How It Works
DFT modifies the SFT objective by dynamically rescaling each token's loss by its predicted probability. This "reward rectification" stabilizes gradient updates, preventing the implicit problematic reward structure that can hinder generalization in standard SFT. The approach is implemented as a single-line code change, making it easy to integrate into existing SFT pipelines.
Quick Start & Requirements
conda and pip. The setup involves creating a Conda environment and installing vllm, sglang, and mcore.torchrun, and evaluation. Links to Qwen2.5-Math repository for evaluation setup are provided.Highlighted Details
ms-swift and has community reproductions.Maintenance & Community
The project is associated with the volcengine/verl framework. Community reproductions and integration with ms-swift suggest active interest.
Licensing & Compatibility
The repository does not explicitly state a license. The associated volcengine/verl repository is Apache 2.0 licensed, but this specific project's licensing requires clarification for commercial use.
Limitations & Caveats
DFT performs best on tasks with non-deterministic solution trajectories (e.g., mathematical CoT, complex coding). Its performance is weaker on tasks with single, near-deterministic ground-truth answers and constrained CoT.
3 weeks ago
Inactive
MoonshotAI