ClawBench  by TIGER-AI-Lab

AI browser agent benchmark for real-world web tasks

Created 1 month ago
336 stars

Top 81.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ClawBench benchmarks AI browser agents on everyday online tasks across live websites. It addresses the need for realistic evaluation by assessing agents on 153 tasks (V1) and 130 tasks (V2) across 144 sites, providing critical insights into current capabilities and limitations.

How It Works

The benchmark employs a 5-layer recording pipeline (video, network, actions, screenshots, messages) within isolated Docker containers on live consumer websites. Evaluation combines DOM matching with an LLM judge to verify end-to-end task success against human references. This method offers a comprehensive, reproducible assessment of agents navigating dynamic web environments.

Quick Start & Requirements

Install via pip: pip install clawbench-eval. For source development, clone the repo and run ./run.sh. Prerequisites include Python 3.11+, uv, and a container engine (Docker/Podman). Key resources include the Project Page, Leaderboard, and Paper.

Highlighted Details

  • Scope: 153 (V1) / 130 (V2) tasks across 144 live consumer websites.
  • Methodology: 5-layer recording pipeline and LLM judge for end-to-end evaluation.
  • Performance: Top agent success rate is 33.3%, indicating significant room for improvement.
  • Data: Task definitions and full 5-layer execution traces available on Hugging Face.
  • Extensibility: Supports multiple agent harnesses for diverse integrations.

Maintenance & Community

Actively maintained with recent updates (May 2026). Fosters community via Discord and GitHub Discussions, encouraging contributions. Supported by NAIL Group and TIGER-Lab.

Licensing & Compatibility

Released under the permissive Apache 2.0 license, enabling broad adoption in open-source and commercial projects.

Limitations & Caveats

Performance can be affected by live website changes. The current 33.3% top score highlights that most tasks remain challenging. Failures can occur due to CAPTCHAs, bot checks, model limitations, or site defenses.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
116
Issues (30d)
104
Star History
234 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
2 more.

BrowserGym by ServiceNow

0.4%
1k
Gym environment for web task automation research
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.