SimpleTIR  by ltzheng

LLMs for multi-turn tool-integrated reasoning with RL

Created 2 months ago
280 stars

Top 93.0% on SourcePulse

GitHubView on GitHub
Project Summary

Summary:

SimpleTIR addresses the challenge of stable, multi-turn Tool-Integrated Reasoning (TIR) for Large Language Models (LLMs) using Reinforcement Learning (RL). It targets researchers and developers seeking to enhance LLM capabilities in complex problem-solving, data analysis, and multi-step reasoning. The project offers a novel RL stabilization technique, enabling diverse reasoning patterns and improved performance over supervised methods.

How It Works:

SimpleTIR employs end-to-end RL to train LLMs for iterative code generation, execution, and result analysis in multi-turn scenarios. It tackles training instability, stemming from tool output distributional drift and compounding errors, by filtering "void" turns (trajectories lacking code or final answers). This approach stabilizes training and fosters diverse reasoning patterns like self-correction and inductive reasoning, surpassing Supervised Fine-Tuning (SFT) limitations.

Quick Start & Requirements:

  • Primary Install/Run: Execute bash train.sh with specified arguments for training or evaluation.
  • Prerequisites: Requires multiple H100 nodes for efficient training/evaluation. Tested with vllm==0.8.5. Recommends ray for multi-node task submission and a sandbox (internal or firejail) for code execution. Base model checkpoints (e.g., Qwen2.5-7B) and datasets are necessary.
  • Resources: Significant GPU resources (multiple H100s) are implied for setup and operation.
  • Links: Paper: arxiv.org/abs/2509.02479, Notion: simpletir.notion.site/report, Hugging Face: huggingface.co/collections/ZhenghaiXue/simpletir-686ce09ae6e1db33b375f03d.

Highlighted Details:

  • Stabilizes multi-turn TIR training via "void" turn filtering.
  • Achieves superior performance compared to alternative approaches.
  • Enables diverse reasoning patterns (inductive, self-correction, cross-validation, progressive) through end-to-end RL.

Maintenance & Community:

  • Contributors: Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Zejun Ma, Bo An.
  • Acknowledgements: Code contributions acknowledged from verl and Search-R1.
  • Community/Roadmap: No explicit community channels (Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility:

  • License Type: The repository's license is not specified in the provided README.
  • Compatibility: No specific compatibility notes for commercial or closed-source use are mentioned.

Limitations & Caveats:

The project explicitly addresses instability in multi-turn RL training, indicating it as a core challenge. A technical paper is still in preparation, suggesting an ongoing research and development phase. High hardware requirements (multiple H100 GPUs) present a significant adoption barrier.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
9
Star History
110 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
14 more.

verifiers by willccbb

3.1%
3k
RL for LLMs in verifiable environments
Created 7 months ago
Updated 22 hours ago
Feedback? Help us improve.