SimpleTIR by ltzheng

LLMs for multi-turn tool-integrated reasoning with RL

Created 6 months ago

348 stars

Top 79.9% on SourcePulse

Project Summary

Summary:

SimpleTIR addresses the challenge of stable, multi-turn Tool-Integrated Reasoning (TIR) for Large Language Models (LLMs) using Reinforcement Learning (RL). It targets researchers and developers seeking to enhance LLM capabilities in complex problem-solving, data analysis, and multi-step reasoning. The project offers a novel RL stabilization technique, enabling diverse reasoning patterns and improved performance over supervised methods.

How It Works:

SimpleTIR employs end-to-end RL to train LLMs for iterative code generation, execution, and result analysis in multi-turn scenarios. It tackles training instability, stemming from tool output distributional drift and compounding errors, by filtering "void" turns (trajectories lacking code or final answers). This approach stabilizes training and fosters diverse reasoning patterns like self-correction and inductive reasoning, surpassing Supervised Fine-Tuning (SFT) limitations.

Quick Start & Requirements:

Primary Install/Run: Execute bash train.sh with specified arguments for training or evaluation.
Prerequisites: Requires multiple H100 nodes for efficient training/evaluation. Tested with vllm==0.8.5. Recommends ray for multi-node task submission and a sandbox (internal or firejail) for code execution. Base model checkpoints (e.g., Qwen2.5-7B) and datasets are necessary.
Resources: Significant GPU resources (multiple H100s) are implied for setup and operation.
Links: Paper: arxiv.org/abs/2509.02479, Notion: simpletir.notion.site/report, Hugging Face: huggingface.co/collections/ZhenghaiXue/simpletir-686ce09ae6e1db33b375f03d.

Highlighted Details:

Stabilizes multi-turn TIR training via "void" turn filtering.
Achieves superior performance compared to alternative approaches.
Enables diverse reasoning patterns (inductive, self-correction, cross-validation, progressive) through end-to-end RL.

Maintenance & Community:

Contributors: Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Zejun Ma, Bo An.
Acknowledgements: Code contributions acknowledged from verl and Search-R1.
Community/Roadmap: No explicit community channels (Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility:

License Type: The repository's license is not specified in the provided README.
Compatibility: No specific compatibility notes for commercial or closed-source use are mentioned.

Limitations & Caveats:

The project explicitly addresses instability in multi-turn RL training, indicating it as a core challenge. A technical paper is still in preparation, suggesting an ongoing research and development phase. High hardware requirements (multiple H100 GPUs) present a significant adoption barrier.

SimpleTIR by ltzheng

Explore Similar Projects

limit-of-RLVR by LeapLabTHU

Tool-Star by RUC-NLPIR

Vision-R1 by Osilly

MARTI by TsinghuaC3I

ToolOrchestra by NVlabs

verl-tool by TIGER-AI-Lab

R-Zero by Chengsong-Huang

M_GRPO by baibizhe

Agent-R1 by 0russwest0

Awesome-LLM-Post-training by mbzuai-oryx

Logic-RL by Unakar

verifiers by PrimeIntellect-ai