WildClawBench by InternLM

Evaluating AI agents with in-the-wild, end-to-end tasks

Created 2 months ago

408 stars

Top 71.0% on SourcePulse

Project Summary

Summary

WildClawBench offers a rigorous, "in-the-wild" benchmark for evaluating AI agents' end-to-end capabilities in real-world scenarios. Targeting AI researchers and engineers, it assesses agents' ability to perform practical, complex tasks autonomously within the OpenClaw personal assistant environment, offering a more meaningful performance metric than isolated skill tests.

How It Works

Agents are deployed into a live OpenClaw instance, a functional personal AI assistant, to execute 60 original, complex tasks. The benchmark stresses multi-step tool orchestration, error recovery, autonomous planning, multimodal understanding (video, image), long-horizon workflows, coding, and safety alignment. This approach uniquely evaluates agents on their capacity for "real work" end-to-end, moving beyond discrete skill assessments.

Quick Start & Requirements

Setup primarily uses Docker, with guides for macOS/Ubuntu. Users download the wildclawbench-ubuntu_v1.2.tar Docker image and workspace data from HuggingFace (Dataset Link). The script/prepare.sh script handles asset downloads (videos, weights), requiring yt-dlp, ffmpeg, gdown. API keys for OpenRouter/Brave Search (brave.com/search/api) are configured via .env. Custom model endpoints are supported via JSON config, but manual edits may be needed for hardcoded Open

WildClawBench by InternLM

Explore Similar Projects

GitTaskBench by QuantaAlpha

ClawBench by TIGER-AI-Lab

project_alice by MarianoMolina

dotcontext by vinilana

TheAgentCompany by TheAgentCompany

skill by pinchbench

workany by workany-ai

awesome-openclaw-skills-zh by clawdbot-ai

OSWorld by xlang-ai

terminal-bench by harbor-framework

ANUS by anus-dev

open-multi-agent by open-multi-agent