ReST-MCTS by THUDM

Research paper on LLM self-training via tree search

Created 1 year ago

686 stars

Top 49.7% on SourcePulse

Project Summary

This repository implements ReST-MCTS*, a novel self-training framework for Large Language Models (LLMs) that leverages Monte Carlo Tree Search (MCTS) guided by inferred process rewards. It targets researchers and developers aiming to improve LLM reasoning capabilities by generating higher-quality training data without extensive manual annotation.

How It Works

ReST-MCTS* integrates MCTS with reinforcement learning to infer per-step rewards that guide the generation of reasoning traces. By using oracle final answers, it estimates the probability that each step contributes to the correct solution. These inferred rewards serve as value targets for refining the reward model and as a selection criterion for high-quality traces used to train the policy model. This approach automates the collection of valuable training data for self-improvement.

Quick Start & Requirements

Installation: Requires separate Conda environments for different model backbones. Install dependencies using pip install -r requirements_mistral.txt (for Mistral/Llama) or pip install -r requirements_sciglm.txt (for SciGLM).
Python Versions: Python 3.11 for GLM models, Python 3.12 for Mistral/Llama models.
Dependencies: Specific versions of the transformers library may be needed for certain Hugging Face models.
Models: Requires local paths to policy and value model checkpoints. Pre-trained policy and reward model data are available on Hugging Face.
Data: A JSON file format is required for input questions, with an optional "answer" field.
Running MCTS: Use MCTS/task.py for single questions (e.g., python MCTS/task.py) or evaluate.py for benchmark evaluation.
Links: ReST-MCTS* Paper, GitHub, Website

Highlighted Details

Supports multiple LLM backbones including Llama3-8B-Instruct, Mistral-7B (MetaMATH), and SciGLM-6B.
Provides pre-trained policy and reward model data for various configurations on Hugging Face.
Includes baseline implementations for self-rewarding (DPO) and ReST-EM (CoT).
Offers evaluation scripts and code for plotting self-training results.

Maintenance & Community

The project is associated with THUDM (Tsinghua University). Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README excerpt. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The setup requires careful management of Python and transformers library versions across different model backbones. Specific instructions for training custom value models are provided, but users may need to adapt code for unsupported models.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days