ToRL  by GAIR-NLP

Tool-integrated RL for autonomous tool discovery and refinement

Created 5 months ago
288 stars

Top 91.2% on SourcePulse

GitHubView on GitHub
Project Summary

ToRL (Tool-Integrated Reinforcement Learning) is a framework for enabling large language models to autonomously discover and refine tool usage strategies through reinforcement learning, targeting researchers and developers working on complex reasoning tasks. It aims to achieve state-of-the-art performance by allowing models to learn when and how to invoke tools, leading to emergent cognitive behaviors like self-correction and adaptive strategy selection.

How It Works

ToRL challenges traditional supervised fine-tuning approaches by employing exploration-driven reinforcement learning for tool integration. Models learn to invoke tools, cross-validate outputs with reasoning, and self-correct errors without explicit human supervision or predefined tool patterns. This approach allows models to adaptively select between tool-based and pure-reasoning strategies, enhancing performance on challenging mathematical benchmarks.

Quick Start & Requirements

  • Environment Setup: Requires conda for environment creation (sandbox-runtime), python==3.11, and installation of dependencies via requirements.txt and runtime/python/requirement.txt. The SandboxFusion tool must be installed and launched separately, with its URL configured in verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py.
  • Training: Execute bash scripts (e.g., scripts/torl_1.5b) to initiate training.
  • Dependencies: wandb, jsonlines, math-verify, hydra-core==1.4.0.dev1, sortedcontainers, qwen-agent[code_interpreter], qwen-agent[python_executor].

Highlighted Details

  • Achieves 43.3% accuracy on AIME2024 with a 7B model, matching larger 32B models.
  • Demonstrates up to 14% higher accuracy compared to baseline models on mathematical benchmarks.
  • Exhibits emergent cognitive behaviors such as self-correction and adaptive strategy selection.
  • Operates directly from base models without imitation learning.

Maintenance & Community

The project acknowledges contributions from DeepSeek R1, Kimi-k1.5, Qwen-Math, VeRL, vLLM, Qwen-Agent, and Sandbox Fusion teams. Further community or roadmap information is not detailed in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project relies on external tools like SandboxFusion and vLLM, which require separate setup and configuration. The README indicates components were released on March 28, 2025, suggesting it is a recent project.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
17 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.