Agentic RL for LLM tool use
New!
Top 64.5% on SourcePulse
Agentic Reinforced Policy Optimization (ARPO) is an agentic RL algorithm designed for training multi-turn LLM-based agents. It addresses the challenge of aligning step-level tool-use behaviors by encouraging adaptive sampling during high-entropy tool-call rounds, leading to more efficient tool utilization. The target audience includes researchers and developers working on LLM agents and reinforcement learning for complex task execution.
How It Works
ARPO's core innovation lies in its approach to managing high-entropy tool-call rounds. Instead of a fixed sampling strategy, ARPO promotes adaptive branching, allowing the policy model to dynamically adjust its exploration based on the uncertainty introduced by external tool feedback. This method aims to improve the alignment of step-level tool-use behaviors, making the agent more efficient in its interactions.
Quick Start & Requirements
sft
) and RL training (arpo
). Install dependencies using pip install -r requirements.txt
within each environment.Highlighted Details
Maintenance & Community
The project is actively maintained, with recent updates in July 2025. It builds upon several other open-source projects like Tool-Star, Llama Factory, and ReCall. Contact is available via email at dongguanting@ruc.edu.cn.
Licensing & Compatibility
Released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The setup process involves multiple steps and requires specific API keys (Bright Data) for certain functionalities. The training scripts are extensive and require careful configuration of paths and parameters. Evaluation requires setting up separate inference environments (vLLM) and running specific evaluation scripts.
1 day ago
Inactive