Step-DPO  by dvlab-research

Step-DPO: LLM training method for long-chain reasoning (research paper)

Created 1 year ago
380 stars

Top 75.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository implements Step-DPO, a method for enhancing the long-chain reasoning capabilities of Large Language Models (LLMs). It targets researchers and developers aiming to improve LLM performance on complex tasks like mathematical reasoning, offering a data-efficient approach with a provided dataset construction pipeline.

How It Works

Step-DPO employs a step-wise preference optimization strategy. It refines LLMs by training on preference pairs that highlight correct reasoning steps, rather than just final answers. This approach is data-efficient, achieving significant performance gains with a relatively small dataset (10K pairs) and a limited number of training steps.

Quick Start & Requirements

  • Install: conda create -n step_dpo python=3.10, conda activate step_dpo, pip install -r requirements.txt.
  • Prerequisites: Python 3.10, accelerate, deepspeed. Training requires significant GPU resources (e.g., 8x A100s for 72B models).
  • Data: Download the xinlai/Math-Step-DPO-10K dataset from Hugging Face.
  • Models: Pre-trained weights for Qwen2, Qwen1.5, Llama-3, and DeepSeekMath are available.
  • Docs: Hugging Face Dataset, Demo.

Highlighted Details

  • Achieves state-of-the-art results, surpassing closed-source models like GPT-4-1106 on MATH and GSM8K benchmarks.
  • Demonstrates significant performance boosts on Qwen2-7B-Instruct (5.6% on MATH, 2.4% on GSM8K) with only 10K data points.
  • Provides a data construction pipeline to generate custom step-wise preference datasets.
  • Offers pre-trained models fine-tuned with Step-DPO, available on Hugging Face.

Maintenance & Community

The project is actively maintained, with recent updates including the release of data construction scripts and a model demo. It is based on established projects like alignment-handbook.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, but the training requirements for larger models are substantial. The data construction pipeline relies on GPT-4o, which may incur API costs.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Casper Hansen Casper Hansen(Author of AutoAWQ), and
1 more.

GPT2 by ConnorJL

0%
1k
GPT2 training implementation, supporting TPUs and GPUs
Created 6 years ago
Updated 2 years ago
Feedback? Help us improve.