ToRL by GAIR-NLP

Tool-integrated RL for autonomous tool discovery and refinement

Created 9 months ago

326 stars

Top 83.8% on SourcePulse

Project Summary

ToRL (Tool-Integrated Reinforcement Learning) is a framework for enabling large language models to autonomously discover and refine tool usage strategies through reinforcement learning, targeting researchers and developers working on complex reasoning tasks. It aims to achieve state-of-the-art performance by allowing models to learn when and how to invoke tools, leading to emergent cognitive behaviors like self-correction and adaptive strategy selection.

How It Works

ToRL challenges traditional supervised fine-tuning approaches by employing exploration-driven reinforcement learning for tool integration. Models learn to invoke tools, cross-validate outputs with reasoning, and self-correct errors without explicit human supervision or predefined tool patterns. This approach allows models to adaptively select between tool-based and pure-reasoning strategies, enhancing performance on challenging mathematical benchmarks.

Quick Start & Requirements

Environment Setup: Requires conda for environment creation (sandbox-runtime), python==3.11, and installation of dependencies via requirements.txt and runtime/python/requirement.txt. The SandboxFusion tool must be installed and launched separately, with its URL configured in verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py.
Training: Execute bash scripts (e.g., scripts/torl_1.5b) to initiate training.
Dependencies: wandb, jsonlines, math-verify, hydra-core==1.4.0.dev1, sortedcontainers, qwen-agent[code_interpreter], qwen-agent[python_executor].

Highlighted Details

Achieves 43.3% accuracy on AIME2024 with a 7B model, matching larger 32B models.
Demonstrates up to 14% higher accuracy compared to baseline models on mathematical benchmarks.
Exhibits emergent cognitive behaviors such as self-correction and adaptive strategy selection.
Operates directly from base models without imitation learning.

Maintenance & Community

The project acknowledges contributions from DeepSeek R1, Kimi-k1.5, Qwen-Math, VeRL, vLLM, Qwen-Agent, and Sandbox Fusion teams. Further community or roadmap information is not detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on external tools like SandboxFusion and vLLM, which require separate setup and configuration. The README indicates components were released on March 28, 2025, suggesting it is a recent project.

ToRL by GAIR-NLP

Explore Similar Projects

EvaLearn by ByteDance-Seed

awesome-in-context-rl by dunnolab

Tool-Star by RUC-NLPIR

InternBootcamp by InternLM

l1 by cmu-l3

Slow_Thinking_with_LLMs by RUCAIBox

machina by DeepX-inc

LUFFY by ElliottYan

R-Zero by Chengsong-Huang

M_GRPO by baibizhe

train-deepseek-r1 by FareedKhan-dev

rllm by rllm-org