WorkArena by ServiceNow

Benchmark for evaluating web agents on knowledge work

Created 2 years ago

258 stars

Top 98.0% on SourcePulse

Project Summary

WorkArena is a benchmark suite designed to evaluate the capabilities of web agents in performing common knowledge work tasks within the ServiceNow platform. It targets AI researchers and developers building agents for enterprise automation, offering a standardized, browser-based environment to assess agent performance on realistic workflows and accelerate the development of effective solutions for knowledge workers.

How It Works

The benchmark utilizes the ServiceNow platform to construct a diverse set of browser-based tasks. WorkArena-L1 features 33 atomic tasks covering core ServiceNow UI components, totaling over 19,000 instances. WorkArena++ composes these atomic elements into more complex, real-world scenarios that test agents' planning, reasoning, and memory abilities. Evaluations are typically conducted using the AgentLab framework, which integrates with BrowserGym for parallel experiments and reporting on a unified leaderboard.

Quick Start & Requirements

Installation: Install the Python package via pip install browsergym-workarena, followed by playwright install.
Prerequisites: Requires access to ServiceNow instances, obtained by submitting a form and accepting terms on Hugging Face (https://huggingface.co/datasets/ServiceNow/WorkArena-Instances). Hugging Face authentication (e.g., huggingface-cli login) is necessary.
Links:
- Benchmark Instances: https://huggingface.co/datasets/ServiceNow/WorkArena-Instances
- Live Demo (Video): https://github.com/ServiceNow/WorkArena/assets/2374980/68640f09-7d6f-4eb1-b556-c294a6afef70
- WorkArena Paper: https://arxiv.org/abs/2403.07718
- WorkArena++ Paper: https://arxiv.org/abs/2407.05291

Highlighted Details

WorkArena-L1 comprises 19,912 unique instances across 33 atomic tasks.
WorkArena++ includes 682 composed tasks evaluating planning, reasoning, and memory.
Task categories span Knowledge Bases, Forms, Service Catalogs, Lists, Menus, and Dashboards.
Designed for evaluation via AgentLab and BrowserGym, facilitating standardized benchmarking.

Maintenance & Community

Community engagement is encouraged via a Discord server.
The project is part of the broader BrowserGym ecosystem and integrates with AgentLab.
Associated research published in ICML 2024 and NeurIPS 2024.

Licensing & Compatibility

The project's README does not explicitly state a software license. This omission requires clarification for adoption decisions, particularly regarding commercial use or derivative works.

Limitations & Caveats

The benchmark is explicitly described as "not solved," indicating that current AI agent performance is still being evaluated and is not consistently optimal across all tasks.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

1

Issues (30d)

2

Star History

7 stars in the last 30 days

Explore Similar Projects

GitTaskBench by QuantaAlpha

Code agent benchmark for real-world repository tasks

Created 10 months ago

Updated 9 months ago

awesome-autonomous-gpt by ScarletPan

Curated list of autonomous AI agent projects and resources

Created 3 years ago

Updated 2 years ago

ClawBench by TIGER-AI-Lab

AI browser agent benchmark for real-world web tasks

Created 3 months ago

Updated 1 day ago

Starred by

Dan Guido

Dan Guido(Cofounder of Trail of Bits).

agent-skills-eval by darkrishabh

Test runner for AI agent skills evaluation

Created 2 months ago

Updated 3 days ago

Agent_Foundation_Models by OPPO-PersonalAI

Agent foundation models for complex problem-solving

Created 11 months ago

Updated 10 months ago

awesome-deep-research-agent by ai-agents-2030

Curated research on deep research agents

Created 1 year ago

Updated 9 months ago

VibeSearchBench by VibeBench

Evaluating advanced search agents with complex, multi-turn interactions

Created 1 month ago

Updated 1 month ago

terminal-bench-2 by harbor-framework

AI agent benchmark for terminal environments

Created 9 months ago

Updated 2 months ago

skill by pinchbench

Benchmarking system for AI coding agents

Created 5 months ago

Updated 1 week ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

2 more.

BrowserGym by ServiceNow

Gym environment for web task automation research

Created 2 years ago

Updated 3 months ago

terminal-bench-3 by harbor-framework

Benchmarking AI agents' real-world task completion in CLI environments

Created 5 months ago

Updated 23 hours ago

Starred by

Shane Thomas

Shane Thomas(Cofounder of Mastra) and

Jasper Zhang

Jasper Zhang(Cofounder of Hyperbolic).

skillsbench by benchflow-ai

Benchmark for AI agent skill utilization

Created 6 months ago

Updated 3 days ago

Feedback? Help us improve.