Math model research paper using curriculum SFT, DPO, and RL
Top 48.1% on sourcepulse
This project provides a framework and pre-trained models for enhancing Large Language Models (LLMs) with advanced reasoning capabilities, specifically targeting complex mathematical problem-solving. It's designed for researchers and developers aiming to build state-of-the-art models for specialized domains like competitive mathematics, offering a practical and cost-effective approach to achieving long Chain-of-Thought (COT) reasoning.
How It Works
Light-R1 employs a multi-stage post-training methodology, starting with curriculum Supervised Fine-Tuning (SFT) and followed by Direct Preference Optimization (DPO). This approach leverages decontaminated mathematical datasets and distills knowledge from existing strong models like DeepSeek-R1. The curriculum involves progressively harder datasets for SFT, followed by DPO to align model behavior with desired reasoning patterns. Model merging is also utilized to combine strengths from different training stages.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is associated with Qihoo 360. Further community interaction details are not explicitly provided in the README.
Licensing & Compatibility
Limitations & Caveats
The models are specialized for mathematical reasoning and may exhibit forgetting on general tasks. Inference requires specific handling of special tokens (<think>
) to elicit reasoning behavior. The README notes potential score deviations if fewer than 64 runs are averaged for evaluation.
2 months ago
1 day