Research paper code for MCTS-boosted reasoning via DPO
Top 86.1% on sourcepulse
This repository provides the source code for MCTS-DPO, a method that enhances reasoning capabilities in language models through iterative preference learning guided by Monte Carlo Tree Search (MCTS). It is designed for researchers and practitioners working on improving the reasoning abilities of large language models, particularly in complex problem-solving domains.
How It Works
MCTS-DPO integrates Monte Carlo Tree Search into the Direct Preference Optimization (DPO) framework. This approach leverages MCTS to explore the decision space of language model outputs, iteratively refining the model's policy based on learned preferences. The MCTS exploration allows for a more robust and efficient learning process compared to standard DPO, especially in tasks requiring multi-step reasoning.
Quick Start & Requirements
conda env create --file conda-recipe.yaml
pip install -r requirements.txt
bash scripts/mcts_mathqa.sh
bash scripts/mcts_csr.sh
requirements.txt
.Highlighted Details
Maintenance & Community
The project is associated with authors Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Further community or maintenance details are not specified in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking is therefore undetermined.
Limitations & Caveats
The README does not specify any limitations, known bugs, or deprecation status. The project appears to be research-oriented, and its production-readiness or long-term support is not detailed.
1 year ago
Inactive