ARPO by RUC-NLPIR

Agentic RL for LLM tool use

Created 4 months ago

807 stars

Top 43.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

Agentic Reinforced Policy Optimization (ARPO) is an agentic RL algorithm designed for training multi-turn LLM-based agents. It addresses the challenge of aligning step-level tool-use behaviors by encouraging adaptive sampling during high-entropy tool-call rounds, leading to more efficient tool utilization. The target audience includes researchers and developers working on LLM agents and reinforcement learning for complex task execution.

How It Works

ARPO's core innovation lies in its approach to managing high-entropy tool-call rounds. Instead of a fixed sampling strategy, ARPO promotes adaptive branching, allowing the policy model to dynamically adjust its exploration based on the uncertainty introduced by external tool feedback. This method aims to improve the alignment of step-level tool-use behaviors, making the agent more efficient in its interactions.

Quick Start & Requirements

Installation: Clone the repository and set up separate Conda environments for SFT (sft) and RL training (arpo). Install dependencies using pip install -r requirements.txt within each environment.
Prerequisites: Python 3.10+, PyTorch 2.4.0 with CUDA 12.4, Flash Attention, and Bright Data API keys for the search tool.
Setup: Requires downloading datasets and configuring API keys and paths in YAML and shell scripts. Training involves multiple stages: optional cold-start SFT, ARPO RL training, and evaluation setup.
Links: Paper, Hugging Face Models

Highlighted Details

Achieves 61.2% Pass@5 on GAIA and 24.0% on HLE with Qwen3-14B, using half the tool calls compared to GRPO.
Supports multi-tool agentic RL training for Qwen2.5, Qwen3, and Llama3 models.
Implements extensive tool-call acceleration and memory optimization.
Includes scripts for SFT, RL training, and evaluation, with model checkpoints available.

Maintenance & Community

The project is actively maintained, with recent updates in July 2025. It builds upon several other open-source projects like Tool-Star, Llama Factory, and ReCall. Contact is available via email at dongguanting@ruc.edu.cn.

Licensing & Compatibility

Released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The setup process involves multiple steps and requires specific API keys (Bright Data) for certain functionalities. The training scripts are extensive and require careful configuration of paths and parameters. Evaluation requires setting up separate inference environments (vLLM) and running specific evaluation scripts.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

58 stars in the last 30 days