ARPO  by dongguanting

Agentic RL for LLM tool use

created 3 weeks ago

New!

470 stars

Top 64.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Agentic Reinforced Policy Optimization (ARPO) is an agentic RL algorithm designed for training multi-turn LLM-based agents. It addresses the challenge of aligning step-level tool-use behaviors by encouraging adaptive sampling during high-entropy tool-call rounds, leading to more efficient tool utilization. The target audience includes researchers and developers working on LLM agents and reinforcement learning for complex task execution.

How It Works

ARPO's core innovation lies in its approach to managing high-entropy tool-call rounds. Instead of a fixed sampling strategy, ARPO promotes adaptive branching, allowing the policy model to dynamically adjust its exploration based on the uncertainty introduced by external tool feedback. This method aims to improve the alignment of step-level tool-use behaviors, making the agent more efficient in its interactions.

Quick Start & Requirements

  • Installation: Clone the repository and set up separate Conda environments for SFT (sft) and RL training (arpo). Install dependencies using pip install -r requirements.txt within each environment.
  • Prerequisites: Python 3.10+, PyTorch 2.4.0 with CUDA 12.4, Flash Attention, and Bright Data API keys for the search tool.
  • Setup: Requires downloading datasets and configuring API keys and paths in YAML and shell scripts. Training involves multiple stages: optional cold-start SFT, ARPO RL training, and evaluation setup.
  • Links: Paper, Hugging Face Models

Highlighted Details

  • Achieves 61.2% Pass@5 on GAIA and 24.0% on HLE with Qwen3-14B, using half the tool calls compared to GRPO.
  • Supports multi-tool agentic RL training for Qwen2.5, Qwen3, and Llama3 models.
  • Implements extensive tool-call acceleration and memory optimization.
  • Includes scripts for SFT, RL training, and evaluation, with model checkpoints available.

Maintenance & Community

The project is actively maintained, with recent updates in July 2025. It builds upon several other open-source projects like Tool-Star, Llama Factory, and ReCall. Contact is available via email at dongguanting@ruc.edu.cn.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The setup process involves multiple steps and requires specific API keys (Bright Data) for certain functionalities. The training scripts are extensive and require careful configuration of paths and parameters. Evaluation requires setting up separate inference environments (vLLM) and running specific evaluation scripts.

Health Check
Last commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
18
Star History
471 stars in the last 24 days

Explore Similar Projects

Starred by Will Brown Will Brown(Research Lead at Prime Intellect), Junyang Lin Junyang Lin(Core Maintainer of Alibaba Qwen), and
4 more.

verifiers by willccbb

2.4%
2k
RL for LLMs in verifiable environments
created 6 months ago
updated 21 hours ago
Feedback? Help us improve.