Light-R1 by Qihoo360

Math model research paper using curriculum SFT, DPO, and RL

Created 10 months ago

757 stars

Top 46.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This project provides a framework and pre-trained models for enhancing Large Language Models (LLMs) with advanced reasoning capabilities, specifically targeting complex mathematical problem-solving. It's designed for researchers and developers aiming to build state-of-the-art models for specialized domains like competitive mathematics, offering a practical and cost-effective approach to achieving long Chain-of-Thought (COT) reasoning.

How It Works

Light-R1 employs a multi-stage post-training methodology, starting with curriculum Supervised Fine-Tuning (SFT) and followed by Direct Preference Optimization (DPO). This approach leverages decontaminated mathematical datasets and distills knowledge from existing strong models like DeepSeek-R1. The curriculum involves progressively harder datasets for SFT, followed by DPO to align model behavior with desired reasoning patterns. Model merging is also utilized to combine strengths from different training stages.

Quick Start & Requirements

Install/Run: Training scripts are provided based on 360-LLaMA-Factory. Inference is suggested via vLLM or SGLang.
Prerequisites: Requires access to computational resources (e.g., 12 x H800 machines for ~6 hours). Specific Python dependencies are managed by the LLaMA-Factory.
Resources: Estimated training cost around $1000.
Links: Paper, W&B, HF Collections, HF Datasets.

Highlighted Details

Achieves state-of-the-art results on AIME24/25 and GPQA benchmarks for 7B, 14B, and 32B models.
Provides all training datasets and code for curriculum SFT and DPO.
Demonstrates effective model merging techniques for performance enhancement.
Includes detailed evaluation logs and methodology, averaging scores over 64 runs.

Maintenance & Community

The project is associated with Qihoo 360. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The models are specialized for mathematical reasoning and may exhibit forgetting on general tasks. Inference requires specific handling of special tokens (<think>) to elicit reasoning behavior. The README notes potential score deviations if fewer than 64 runs are averaged for evaluation.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days