WorkArena  by ServiceNow

Benchmark for evaluating web agents on knowledge work

Created 2 years ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

WorkArena is a benchmark suite designed to evaluate the capabilities of web agents in performing common knowledge work tasks within the ServiceNow platform. It targets AI researchers and developers building agents for enterprise automation, offering a standardized, browser-based environment to assess agent performance on realistic workflows and accelerate the development of effective solutions for knowledge workers.

How It Works

The benchmark utilizes the ServiceNow platform to construct a diverse set of browser-based tasks. WorkArena-L1 features 33 atomic tasks covering core ServiceNow UI components, totaling over 19,000 instances. WorkArena++ composes these atomic elements into more complex, real-world scenarios that test agents' planning, reasoning, and memory abilities. Evaluations are typically conducted using the AgentLab framework, which integrates with BrowserGym for parallel experiments and reporting on a unified leaderboard.

Quick Start & Requirements

Highlighted Details

  • WorkArena-L1 comprises 19,912 unique instances across 33 atomic tasks.
  • WorkArena++ includes 682 composed tasks evaluating planning, reasoning, and memory.
  • Task categories span Knowledge Bases, Forms, Service Catalogs, Lists, Menus, and Dashboards.
  • Designed for evaluation via AgentLab and BrowserGym, facilitating standardized benchmarking.

Maintenance & Community

  • Community engagement is encouraged via a Discord server.
  • The project is part of the broader BrowserGym ecosystem and integrates with AgentLab.
  • Associated research published in ICML 2024 and NeurIPS 2024.

Licensing & Compatibility

  • The project's README does not explicitly state a software license. This omission requires clarification for adoption decisions, particularly regarding commercial use or derivative works.

Limitations & Caveats

  • The benchmark is explicitly described as "not solved," indicating that current AI agent performance is still being evaluated and is not consistently optimal across all tasks.
Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
2 more.

BrowserGym by ServiceNow

0.4%
1k
Gym environment for web task automation research
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.