Discover and explore top open-source AI tools and projects—updated daily.
TIGER-AI-LabAI browser agent benchmark for real-world web tasks
Top 81.8% on SourcePulse
Summary
ClawBench benchmarks AI browser agents on everyday online tasks across live websites. It addresses the need for realistic evaluation by assessing agents on 153 tasks (V1) and 130 tasks (V2) across 144 sites, providing critical insights into current capabilities and limitations.
How It Works
The benchmark employs a 5-layer recording pipeline (video, network, actions, screenshots, messages) within isolated Docker containers on live consumer websites. Evaluation combines DOM matching with an LLM judge to verify end-to-end task success against human references. This method offers a comprehensive, reproducible assessment of agents navigating dynamic web environments.
Quick Start & Requirements
Install via pip: pip install clawbench-eval. For source development, clone the repo and run ./run.sh. Prerequisites include Python 3.11+, uv, and a container engine (Docker/Podman). Key resources include the Project Page, Leaderboard, and Paper.
Highlighted Details
Maintenance & Community
Actively maintained with recent updates (May 2026). Fosters community via Discord and GitHub Discussions, encouraging contributions. Supported by NAIL Group and TIGER-Lab.
Licensing & Compatibility
Released under the permissive Apache 2.0 license, enabling broad adoption in open-source and commercial projects.
Limitations & Caveats
Performance can be affected by live website changes. The current 33.3% top score highlights that most tasks remain challenging. Failures can occur due to CAPTCHAs, bot checks, model limitations, or site defenses.
2 days ago
Inactive
TheAgentCompany
ServiceNow
xlang-ai