ClawBench by TIGER-AI-Lab

AI browser agent benchmark for real-world web tasks

Created 3 months ago

467 stars

Top 64.3% on SourcePulse

Project Summary

Summary

ClawBench benchmarks AI browser agents on everyday online tasks across live websites. It addresses the need for realistic evaluation by assessing agents on 153 tasks (V1) and 130 tasks (V2) across 144 sites, providing critical insights into current capabilities and limitations.

How It Works

The benchmark employs a 5-layer recording pipeline (video, network, actions, screenshots, messages) within isolated Docker containers on live consumer websites. Evaluation combines DOM matching with an LLM judge to verify end-to-end task success against human references. This method offers a comprehensive, reproducible assessment of agents navigating dynamic web environments.

Quick Start & Requirements

Install via pip: pip install clawbench-eval. For source development, clone the repo and run ./run.sh. Prerequisites include Python 3.11+, uv, and a container engine (Docker/Podman). Key resources include the Project Page, Leaderboard, and Paper.

Highlighted Details

Scope: 153 (V1) / 130 (V2) tasks across 144 live consumer websites.
Methodology: 5-layer recording pipeline and LLM judge for end-to-end evaluation.
Performance: Top agent success rate is 33.3%, indicating significant room for improvement.
Data: Task definitions and full 5-layer execution traces available on Hugging Face.
Extensibility: Supports multiple agent harnesses for diverse integrations.

Maintenance & Community

Actively maintained with recent updates (May 2026). Fosters community via Discord and GitHub Discussions, encouraging contributions. Supported by NAIL Group and TIGER-Lab.

Licensing & Compatibility

Released under the permissive Apache 2.0 license, enabling broad adoption in open-source and commercial projects.

Limitations & Caveats

Performance can be affected by live website changes. The current 33.3% top score highlights that most tasks remain challenging. Failures can occur due to CAPTCHAs, bot checks, model limitations, or site defenses.

ClawBench by TIGER-AI-Lab

Explore Similar Projects

GitTaskBench by QuantaAlpha

WebCanvas by iMeanAI

agent-skills-eval by darkrishabh

WorkArena by ServiceNow

agents-last-exam by rdi-berkeley

WildClawBench by InternLM

terminal-bench-2 by harbor-framework

openclaw-mission-control by robsannaa

TheAgentCompany by TheAgentCompany

webarena by web-arena-x

terminal-bench-3 by harbor-framework

OSWorld by xlang-ai