Step-DPO  by dvlab-research

Step-DPO: LLM training method for long-chain reasoning (research paper)

created 1 year ago
375 stars

Top 76.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository implements Step-DPO, a method for enhancing the long-chain reasoning capabilities of Large Language Models (LLMs). It targets researchers and developers aiming to improve LLM performance on complex tasks like mathematical reasoning, offering a data-efficient approach with a provided dataset construction pipeline.

How It Works

Step-DPO employs a step-wise preference optimization strategy. It refines LLMs by training on preference pairs that highlight correct reasoning steps, rather than just final answers. This approach is data-efficient, achieving significant performance gains with a relatively small dataset (10K pairs) and a limited number of training steps.

Quick Start & Requirements

  • Install: conda create -n step_dpo python=3.10, conda activate step_dpo, pip install -r requirements.txt.
  • Prerequisites: Python 3.10, accelerate, deepspeed. Training requires significant GPU resources (e.g., 8x A100s for 72B models).
  • Data: Download the xinlai/Math-Step-DPO-10K dataset from Hugging Face.
  • Models: Pre-trained weights for Qwen2, Qwen1.5, Llama-3, and DeepSeekMath are available.
  • Docs: Hugging Face Dataset, Demo.

Highlighted Details

  • Achieves state-of-the-art results, surpassing closed-source models like GPT-4-1106 on MATH and GSM8K benchmarks.
  • Demonstrates significant performance boosts on Qwen2-7B-Instruct (5.6% on MATH, 2.4% on GSM8K) with only 10K data points.
  • Provides a data construction pipeline to generate custom step-wise preference datasets.
  • Offers pre-trained models fine-tuned with Step-DPO, available on Hugging Face.

Maintenance & Community

The project is actively maintained, with recent updates including the release of data construction scripts and a model demo. It is based on established projects like alignment-handbook.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, but the training requirements for larger models are substantial. The data construction pipeline relies on GPT-4o, which may incur API costs.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.