WildClawBench  by InternLM

Evaluating AI agents with in-the-wild, end-to-end tasks

Created 2 weeks ago

New!

266 stars

Top 96.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

WildClawBench offers a rigorous, "in-the-wild" benchmark for evaluating AI agents' end-to-end capabilities in real-world scenarios. Targeting AI researchers and engineers, it assesses agents' ability to perform practical, complex tasks autonomously within the OpenClaw personal assistant environment, offering a more meaningful performance metric than isolated skill tests.

How It Works

Agents are deployed into a live OpenClaw instance, a functional personal AI assistant, to execute 60 original, complex tasks. The benchmark stresses multi-step tool orchestration, error recovery, autonomous planning, multimodal understanding (video, image), long-horizon workflows, coding, and safety alignment. This approach uniquely evaluates agents on their capacity for "real work" end-to-end, moving beyond discrete skill assessments.

Quick Start & Requirements

Setup primarily uses Docker, with guides for macOS/Ubuntu. Users download the wildclawbench-ubuntu_v1.2.tar Docker image and workspace data from HuggingFace (Dataset Link). The script/prepare.sh script handles asset downloads (videos, weights), requiring yt-dlp, ffmpeg, gdown. API keys for OpenRouter/Brave Search (brave.com/search/api) are configured via .env. Custom model endpoints are supported via JSON config, but manual edits may be needed for hardcoded Open

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
267 stars in the last 20 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), and
10 more.

terminal-bench by harbor-framework

4.0%
2k
Benchmark for LLM agents in real terminal environments
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.