MCTS-DPO  by YuxiXie

Research paper code for MCTS-boosted reasoning via DPO

created 1 year ago
319 stars

Top 86.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the source code for MCTS-DPO, a method that enhances reasoning capabilities in language models through iterative preference learning guided by Monte Carlo Tree Search (MCTS). It is designed for researchers and practitioners working on improving the reasoning abilities of large language models, particularly in complex problem-solving domains.

How It Works

MCTS-DPO integrates Monte Carlo Tree Search into the Direct Preference Optimization (DPO) framework. This approach leverages MCTS to explore the decision space of language model outputs, iteratively refining the model's policy based on learned preferences. The MCTS exploration allows for a more robust and efficient learning process compared to standard DPO, especially in tasks requiring multi-step reasoning.

Quick Start & Requirements

  • Install:
    conda env create --file conda-recipe.yaml
    pip install -r requirements.txt
    
  • Datasets: Requires downloading datasets such as Arithmo, GSM8K, MATH, ARC, AI2S, OBQA, and SciQ.
  • Run:
    bash scripts/mcts_mathqa.sh
    bash scripts/mcts_csr.sh
    
  • Prerequisites: Python environment managed by Conda, specific dependencies listed in requirements.txt.

Highlighted Details

  • Implements "Self-Evaluation Guided MCTS for online DPO."
  • Code adapted from the Safe-RLHF repository.
  • Supports reasoning tasks on datasets like GSM8K and MATH.
  • Tested with Mistral (SFT) models.

Maintenance & Community

The project is associated with authors Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Further community or maintenance details are not specified in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The README does not specify any limitations, known bugs, or deprecation status. The project appears to be research-oriented, and its production-readiness or long-term support is not detailed.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 4 days ago
Feedback? Help us improve.