verl-agent  by langfengQ

RL for LLM/VLM agent training

created 4 months ago
668 stars

Top 51.4% on sourcepulse

GitHubView on GitHub
Project Summary

verl-agent is a Python framework for training LLM/VLM agents using reinforcement learning, specifically designed for long-horizon, multi-turn interactions. It addresses the scalability limitations of previous methods by processing each interaction step independently with customizable input structures, enabling efficient training for complex tasks. The framework supports a wide range of LLMs, RL algorithms, and interactive environments, making it suitable for researchers and developers working on advanced agent capabilities.

How It Works

verl-agent employs a step-wise interaction design, allowing for fully customizable input structures at each turn. This contrasts with prior approaches that concatenate full interaction histories, which become computationally prohibitive for long-horizon tasks. By keeping input lengths consistent, verl-agent achieves high scalability. It also introduces "group environments" where multiple environments share identical initial states, beneficial for algorithms like GRPO and DAPO that require repeated rollouts on the same state.

Quick Start & Requirements

  • Installation: Requires Python 3.12, PyTorch 2.6.0 (with CUDA 12.4), flash-attn 2.7.4.post1, and vllm 0.8.5. Installation involves creating a conda environment and running pip install -e . after setting up PyTorch.
  • Environments: Each environment (ALFWorld, WebShop, Sokoban, Gym Cards, AppWorld) has specific installation instructions and dependencies, often requiring separate conda environments to avoid conflicts. ALFWorld requires gymnasium 0.29.1 and stable-baselines3 2.6.0. WebShop requires Python <=3.10 and manual downloads.
  • Resources: Training 7B models with LoRA is supported on 2 H100 GPUs.
  • Links: Official Docs

Highlighted Details

  • Supports multi-turn, step-wise interaction with customizable per-step input structures for scalability.
  • Implements novel GiGPO algorithm for efficient credit assignment in long-horizon LLM agent training.
  • Offers support for text-only and vision-language agents, including models like Qwen3, Qwen2.5-VL, and LLaMA3.1.
  • Includes a diverse suite of RL algorithms (GiGPO, GRPO, PPO, DAPO, RLOO, REINFORCE++) and environments (ALFWorld, WebShop, Sokoban, Gym Cards, AppWorld).
  • Supports LoRA fine-tuning for reduced computational costs.

Maintenance & Community

The project is associated with the paper "Group-in-Group Policy Optimization for LLM Agent Training." Recent updates include support for Qwen3, LoRA, REINFORCE++, and RLOO. The project acknowledges the veRL team and RAGEN project for foundational infrastructure and inspiration.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The AppWorld environment is experimental. To reproduce paper results for GiGPO, users must use a version released prior to the "[2025.06.03] Major Update." There are noted dependency incompatibilities (e.g., typer) for WebShop installation, though these are flagged as ignorable.

Health Check
Last commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
22
Star History
676 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.