demystify-long-cot by eddycmu

Research code for long chain-of-thought reasoning in LLMs

Created 11 months ago

329 stars

Top 83.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Jeff Hammerbacher

Cofounder of Cloudera

Will Brown

Research Lead at Prime Intellect

Project Summary

This repository provides code and experimental setups for investigating how Large Language Models (LLMs) learn and generate long Chain-of-Thought (CoT) reasoning. It targets researchers and practitioners aiming to improve LLM reasoning capabilities, particularly in complex domains like mathematics, by enabling longer, more structured reasoning processes.

How It Works

The project forks OpenRLHF, introducing modifications to support rule-based reward functions (e.g., Cosine Reward for length control) and multiple reward types with different discount factors for PPO and Reinforce++. It also integrates an "LLM-as-a-judge" component for reference-guided verification and includes MinHash for identifying reasoning patterns in pre-training data. This approach aims to systematically understand and replicate the long CoT generation observed in advanced models.

Quick Start & Requirements

Installation: Recommended via Docker. Inside the container: sudo pip uninstall xgboost transformer_engine flash_attn -y then pip install openrlhf. For vLLM acceleration: pip install openrlhf[vllm] or pip install openrlhf[vllm_latest]. Alternatively, clone the repo and pip install -e ..
Prerequisites: NVIDIA GPU with CUDA, Docker, Python. vLLM 0.6.4 or higher is recommended.
Resources: Requires setting up an OpenRLHF environment. Specific resource needs depend on the experiments run.
Links: OpenRLHF Docs (base project), Paper

Highlighted Details

Implements rule-based reward functions for stabilizing and controlling CoT length.
Integrates "LLM-as-a-judge" as a verifier compatible with rule-based rewards.
Includes MinHash for searching pre-training data for long CoT reasoning patterns.
Supports multiple reward types with different discount factors for PPO and Reinforce++.

Maintenance & Community

The project is based on OpenRLHF and acknowledges contributions from various LLM projects. Further development is indicated by TODOs for action prompting code and additional run scripts.

Licensing & Compatibility

The repository is a fork of OpenRLHF, which is typically Apache 2.0 licensed. Specific licensing for the modifications is not explicitly stated in the README, but it builds upon Apache 2.0 licensed projects. Compatibility for commercial use is likely inherited from the base project, but verification is recommended.

Limitations & Caveats

The README notes that run scripts require minor fixes for file paths and API keys. Some dependencies may vary based on the environment. The project is presented alongside a research paper, suggesting it's primarily for experimental reproduction and further research rather than a production-ready library.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days