Light-R1  by Qihoo360

Math model research paper using curriculum SFT, DPO, and RL

created 5 months ago
734 stars

Top 48.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a framework and pre-trained models for enhancing Large Language Models (LLMs) with advanced reasoning capabilities, specifically targeting complex mathematical problem-solving. It's designed for researchers and developers aiming to build state-of-the-art models for specialized domains like competitive mathematics, offering a practical and cost-effective approach to achieving long Chain-of-Thought (COT) reasoning.

How It Works

Light-R1 employs a multi-stage post-training methodology, starting with curriculum Supervised Fine-Tuning (SFT) and followed by Direct Preference Optimization (DPO). This approach leverages decontaminated mathematical datasets and distills knowledge from existing strong models like DeepSeek-R1. The curriculum involves progressively harder datasets for SFT, followed by DPO to align model behavior with desired reasoning patterns. Model merging is also utilized to combine strengths from different training stages.

Quick Start & Requirements

  • Install/Run: Training scripts are provided based on 360-LLaMA-Factory. Inference is suggested via vLLM or SGLang.
  • Prerequisites: Requires access to computational resources (e.g., 12 x H800 machines for ~6 hours). Specific Python dependencies are managed by the LLaMA-Factory.
  • Resources: Estimated training cost around $1000.
  • Links: Paper, W&B, HF Collections, HF Datasets.

Highlighted Details

  • Achieves state-of-the-art results on AIME24/25 and GPQA benchmarks for 7B, 14B, and 32B models.
  • Provides all training datasets and code for curriculum SFT and DPO.
  • Demonstrates effective model merging techniques for performance enhancement.
  • Includes detailed evaluation logs and methodology, averaging scores over 64 runs.

Maintenance & Community

The project is associated with Qihoo 360. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The models are specialized for mathematical reasoning and may exhibit forgetting on general tasks. Inference requires specific handling of special tokens (<think>) to elicit reasoning behavior. The README notes potential score deviations if fewer than 64 runs are averaged for evaluation.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
59 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.