claw-eval by claw-eval

Evaluate LLM agents in real-world scenarios with a transparent benchmark

Created 4 months ago

710 stars

Top 47.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Edward Sun

Research Scientist at Meta Superintelligence Lab

Project Summary

Claw-Eval provides an end-to-end, transparent benchmark for evaluating Large Language Model (LLM) agents acting in real-world scenarios. It addresses the need for reproducible and robust evaluation by offering 139 human-verified tasks, 15 integrated services, and Docker-based sandboxing. This allows researchers and developers to rigorously assess agent performance with a focus on reliability and real-world complexity.

How It Works

The core methodology employs Docker sandboxes for isolated execution and a strict "Pass^3" metric, requiring models to succeed in three independent trials to achieve a pass credit. This approach aims to eliminate "lucky runs" and ensure consistent performance. The system is designed for end-to-end reproducibility, with mechanisms to handle API instability by re-triggering evaluations, ensuring three successful trajectories are generated.

Quick Start & Requirements

Installation: Use uv for dependency management (pip install uv, uv venv --python 3.11, source .venv/bin/activate).
Prerequisites: API keys for OPENROUTER_API_KEY and SERP_DEV_KEY (for web search tasks) are required. Python 3.11 is recommended.
Execution: Run sandbox tests via scripts/test_sandbox.sh or batch evaluations using claw-eval batch --config <config_file> --sandbox --trials 3 --parallel <N>.
Documentation: Leaderboard and task details are available at https://claw-eval.github.io.

Highlighted Details

Supports 139 tasks and 23 models, with a live leaderboard.
Introduced Pass^3 metric for enhanced reliability across three independent trials.
Version 1.1.0 adds 35 multimodal agentic tasks, expanding capabilities to perception, reasoning, creation, and delivery.
Commitment to end-to-end reproducibility, with ongoing codebase audits.

Maintenance & Community

Key contributors include Bowen Ye, Rang Li, Qibin Yang, Zhihui Xie, and Lei Li (Project Lead).
No explicit community channels (Discord/Slack) or social media links are provided in the README. A roadmap is outlined.

Licensing & Compatibility

Licensed under the MIT license, permitting broad use, including commercial applications.

Limitations & Caveats

The codebase is currently undergoing an audit to ensure full reproducibility of benchmark results.
Future roadmap items indicate potential for enhanced scoring logic, state verification, and sandbox isolation, suggesting these areas may be less mature in the current version.
The project appears relatively new, with core releases documented from March 2026.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

57 stars in the last 30 days