Discover and explore top open-source AI tools and projects—updated daily.
InternLMEvaluating AI agents with in-the-wild, end-to-end tasks
New!
Top 96.1% on SourcePulse
Summary
WildClawBench offers a rigorous, "in-the-wild" benchmark for evaluating AI agents' end-to-end capabilities in real-world scenarios. Targeting AI researchers and engineers, it assesses agents' ability to perform practical, complex tasks autonomously within the OpenClaw personal assistant environment, offering a more meaningful performance metric than isolated skill tests.
How It Works
Agents are deployed into a live OpenClaw instance, a functional personal AI assistant, to execute 60 original, complex tasks. The benchmark stresses multi-step tool orchestration, error recovery, autonomous planning, multimodal understanding (video, image), long-horizon workflows, coding, and safety alignment. This approach uniquely evaluates agents on their capacity for "real work" end-to-end, moving beyond discrete skill assessments.
Quick Start & Requirements
Setup primarily uses Docker, with guides for macOS/Ubuntu. Users download the wildclawbench-ubuntu_v1.2.tar Docker image and workspace data from HuggingFace (Dataset Link). The script/prepare.sh script handles asset downloads (videos, weights), requiring yt-dlp, ffmpeg, gdown. API keys for OpenRouter/Brave Search (brave.com/search/api) are configured via .env. Custom model endpoints are supported via JSON config, but manual edits may be needed for hardcoded Open
1 week ago
Inactive
TheAgentCompany
xlang-ai
harbor-framework