ReST-MCTS  by THUDM

Research paper on LLM self-training via tree search

Created 1 year ago
665 stars

Top 50.6% on SourcePulse

GitHubView on GitHub
Project Summary

This repository implements ReST-MCTS*, a novel self-training framework for Large Language Models (LLMs) that leverages Monte Carlo Tree Search (MCTS) guided by inferred process rewards. It targets researchers and developers aiming to improve LLM reasoning capabilities by generating higher-quality training data without extensive manual annotation.

How It Works

ReST-MCTS* integrates MCTS with reinforcement learning to infer per-step rewards that guide the generation of reasoning traces. By using oracle final answers, it estimates the probability that each step contributes to the correct solution. These inferred rewards serve as value targets for refining the reward model and as a selection criterion for high-quality traces used to train the policy model. This approach automates the collection of valuable training data for self-improvement.

Quick Start & Requirements

  • Installation: Requires separate Conda environments for different model backbones. Install dependencies using pip install -r requirements_mistral.txt (for Mistral/Llama) or pip install -r requirements_sciglm.txt (for SciGLM).
  • Python Versions: Python 3.11 for GLM models, Python 3.12 for Mistral/Llama models.
  • Dependencies: Specific versions of the transformers library may be needed for certain Hugging Face models.
  • Models: Requires local paths to policy and value model checkpoints. Pre-trained policy and reward model data are available on Hugging Face.
  • Data: A JSON file format is required for input questions, with an optional "answer" field.
  • Running MCTS: Use MCTS/task.py for single questions (e.g., python MCTS/task.py) or evaluate.py for benchmark evaluation.
  • Links: ReST-MCTS* Paper, GitHub, Website

Highlighted Details

  • Supports multiple LLM backbones including Llama3-8B-Instruct, Mistral-7B (MetaMATH), and SciGLM-6B.
  • Provides pre-trained policy and reward model data for various configurations on Hugging Face.
  • Includes baseline implementations for self-rewarding (DPO) and ReST-EM (CoT).
  • Offers evaluation scripts and code for plotting self-training results.

Maintenance & Community

The project is associated with THUDM (Tsinghua University). Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README excerpt. Compatibility for commercial use or closed-source linking would require clarification of the license.

Limitations & Caveats

The setup requires careful management of Python and transformers library versions across different model backbones. Specific instructions for training custom value models are provided, but users may need to adapt code for unsupported models.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
6 more.

awesome-o1 by srush

0%
1k
Bibliography for OpenAI's o1 project
Created 11 months ago
Updated 10 months ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

self-rewarding-lm-pytorch by lucidrains

0.1%
1k
Training framework for self-rewarding language models
Created 1 year ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

Eureka by eureka-research

0.2%
3k
LLM-based reward design for reinforcement learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.