Step-DPO by JIA-Lab-research

Step-DPO: LLM training method for long-chain reasoning (research paper)

Created 1 year ago

389 stars

Top 73.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Binyuan Hui

Research Scientist at Alibaba Qwen

Project Summary

This repository implements Step-DPO, a method for enhancing the long-chain reasoning capabilities of Large Language Models (LLMs). It targets researchers and developers aiming to improve LLM performance on complex tasks like mathematical reasoning, offering a data-efficient approach with a provided dataset construction pipeline.

How It Works

Step-DPO employs a step-wise preference optimization strategy. It refines LLMs by training on preference pairs that highlight correct reasoning steps, rather than just final answers. This approach is data-efficient, achieving significant performance gains with a relatively small dataset (10K pairs) and a limited number of training steps.

Quick Start & Requirements

Install: conda create -n step_dpo python=3.10, conda activate step_dpo, pip install -r requirements.txt.
Prerequisites: Python 3.10, accelerate, deepspeed. Training requires significant GPU resources (e.g., 8x A100s for 72B models).
Data: Download the xinlai/Math-Step-DPO-10K dataset from Hugging Face.
Models: Pre-trained weights for Qwen2, Qwen1.5, Llama-3, and DeepSeekMath are available.
Docs: Hugging Face Dataset, Demo.

Highlighted Details

Achieves state-of-the-art results, surpassing closed-source models like GPT-4-1106 on MATH and GSM8K benchmarks.
Demonstrates significant performance boosts on Qwen2-7B-Instruct (5.6% on MATH, 2.4% on GSM8K) with only 10K data points.
Provides a data construction pipeline to generate custom step-wise preference datasets.
Offers pre-trained models fine-tuned with Step-DPO, available on Hugging Face.

Maintenance & Community

The project is actively maintained, with recent updates including the release of data construction scripts and a model demo. It is based on established projects like alignment-handbook.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, but the training requirements for larger models are substantial. The data construction pipeline relies on GPT-4o, which may incur API costs.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days