WebRL by THUDM

Framework for training LLM web agents using reinforcement learning

Created 1 year ago

492 stars

Top 62.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Elvis Saravia

Founder of DAIR.AI

Project Summary

WebRL provides a framework for training Large Language Model (LLM) web agents using a self-evolving online curriculum reinforcement learning technique. It targets the WebArena environment, enabling agents to learn complex web navigation and task completion skills. The project is suitable for researchers and developers focused on autonomous agents and LLM-powered web interaction.

How It Works

WebRL employs a reinforcement learning approach where the agent learns to interact with web environments. A key innovation is the self-evolving online curriculum, which dynamically adjusts the learning tasks to optimize agent progress. This curriculum generation, coupled with an Outcome-supervised Reward Model (ORM), allows for efficient and effective training of web agents.

Quick Start & Requirements

Install: Create a conda environment (python==3.10), activate it, cd WebRL, and run pip install -e ..
Dependencies: Python 3.10, Conda.
Models: Requires downloading pre-trained checkpoints for Actor (e.g., WebRL-GLM-4-9B, WebRL-LLaMA-3.1-8B) and ORM (e.g., ORM-Llama-3.1-8B).
Environment: Interaction with WebArena is facilitated via VAB-WebArena-Lite.
Docs: 📃 Paper

Highlighted Details

Supports multiple LLM backbones including GLM and LLaMA variants.
Includes scripts for SFT baseline training using LLaMA-Factory.
Provides detailed procedures for generating new instructions and processing interaction data with reward labeling.

Maintenance & Community

The project is associated with THUDM.
Links to model checkpoints are provided on Hugging Face and ModelScope.

Licensing & Compatibility

The README does not explicitly state the license. However, the citation indicates it's an arXiv preprint, suggesting potential for future licensing decisions. Compatibility for commercial use is not specified.

Limitations & Caveats

The project is presented as a research preprint, and its stability, long-term maintenance, and production readiness are not yet established. Specific hardware requirements for training and inference are not detailed.

Health Check

Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days