MCTS-DPO by YuxiXie

Research paper code for MCTS-boosted reasoning via DPO

Created 2 years ago

328 stars

Top 83.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Addair

Cofounder of Predibase

Project Summary

This repository provides the source code for MCTS-DPO, a method that enhances reasoning capabilities in language models through iterative preference learning guided by Monte Carlo Tree Search (MCTS). It is designed for researchers and practitioners working on improving the reasoning abilities of large language models, particularly in complex problem-solving domains.

How It Works

MCTS-DPO integrates Monte Carlo Tree Search into the Direct Preference Optimization (DPO) framework. This approach leverages MCTS to explore the decision space of language model outputs, iteratively refining the model's policy based on learned preferences. The MCTS exploration allows for a more robust and efficient learning process compared to standard DPO, especially in tasks requiring multi-step reasoning.

Quick Start & Requirements

Install:

conda env create --file conda-recipe.yaml
pip install -r requirements.txt

Datasets: Requires downloading datasets such as Arithmo, GSM8K, MATH, ARC, AI2S, OBQA, and SciQ.

Run:

bash scripts/mcts_mathqa.sh
bash scripts/mcts_csr.sh

Prerequisites: Python environment managed by Conda, specific dependencies listed in requirements.txt.

Highlighted Details

Implements "Self-Evaluation Guided MCTS for online DPO."
Code adapted from the Safe-RLHF repository.
Supports reasoning tasks on datasets like GSM8K and MATH.
Tested with Mistral (SFT) models.

Maintenance & Community

The project is associated with authors Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Further community or maintenance details are not specified in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.

Limitations & Caveats

The README does not specify any limitations, known bugs, or deprecation status. The project appears to be research-oriented, and its production-readiness or long-term support is not detailed.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days