the-markovian-thinker  by McGill-NLP

LLM reasoning with bounded state

Created 3 weeks ago

New!

300 stars

Top 88.6% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> The Markovian Thinker project introduces a novel paradigm for efficient reasoning in Large Language Models (LLMs) trained with Reinforcement Learning (RL). It addresses the quadratic computational complexity of traditional RLHF by proposing a fixed-size state representation, enabling linear scaling with reasoning length. This benefits researchers and practitioners seeking to improve LLM reasoning capabilities without prohibitive computational costs.

How It Works

The core innovation is the "Markovian Thinking" paradigm, which reformulates the RL environment to maintain a bounded, fixed-size state. This is implemented via the "Delethink" approach, which processes generation in fixed-size chunks. At chunk boundaries, the context is reset to the original prompt plus a concise carryover, compelling the model to learn state-dependent progress. This contrasts with sequential token concatenation (e.g., LongCoT), which leads to exponentially growing state sizes and quadratic compute costs. Delethink achieves linear compute complexity and flat memory usage.

Quick Start & Requirements

  • Installation relies on the verl and SGLang frameworks, with options for a pre-built Docker image or uv-based installation (details in INSTALLATION.md).
  • Prerequisites include significant GPU resources (demonstrated with multi-GPU setups and H100s). Specific models like deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B are used for training and demos.
  • Trajectory visualization requires textual==0.52.1.
  • Links: INSTALLATION.md, verl.readthedocs.io (for Ray debugger).

Highlighted Details

  • Delethink (24K context) achieves comparable or superior accuracy to LongCoT-RL (24K context) with reduced compute.
  • The method demonstrates continued performance improvement beyond its trained context budget, unlike methods that plateau.
  • Training exhibits linear compute scaling with reasoning length, contrasting with LongCoT's quadratic scaling.
  • Large models like GPT-OSS-120B and Qwen3-30B-A3B show zero-shot Markovian Thinking capabilities.
  • The framework supports scaling reasoning to 96K tokens.

Maintenance & Community

  • Authored by researchers from McGill University, Mila, and Microsoft.
  • The codebase is built upon verl and SGLang.
  • The project was released in October 2025, with paper, models, and codebase available.
  • No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

  • The README does not specify a software license. This omission requires clarification for adoption decisions, especially regarding commercial use or derivative works.

Limitations & Caveats

  • The evaluation section is marked as "TBD," suggesting that comprehensive performance benchmarks or results may still be pending.
  • As a very recent release (October 2025), long-term maintenance and community support are yet to be established.
Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
302 stars in the last 27 days

Explore Similar Projects

Starred by Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
4 more.

batch_invariant_ops by thinking-machines-lab

1.0%
875
Enhance LLM inference determinism
Created 1 month ago
Updated 19 hours ago
Feedback? Help us improve.