skill by pinchbench

Benchmarking system for AI coding agents

Created 4 months ago

1,227 stars

Top 31.6% on SourcePulse

Project Summary

PinchBench is a benchmarking system designed to evaluate Large Language Models (LLMs) specifically as OpenClaw coding agents. It addresses the limitations of traditional, synthetic benchmarks by testing LLMs on real-world tasks such as scheduling, coding, email triaging, and file management, providing a practical measure of their utility and performance in agentic applications.

How It Works

PinchBench evaluates LLM models as the cognitive engine for OpenClaw agents by executing a suite of real-world tasks. Its core design prioritizes practical agent performance over isolated LLM capabilities. The system rigorously tests crucial agent functionalities: accurate tool invocation with correct parameters, the ability to chain actions for multi-step reasoning, robust handling of real-world data messiness and ambiguous instructions, and ultimately, the achievement of tangible, practical outcomes like file manipulation or communication. Each task is designed for automatic grading, often augmented by an LLM judge, ensuring both objective accuracy and nuanced assessment of agent behavior.

Quick Start & Requirements

Primary install/run command: Clone the repository (git clone https://github.com/pinchbench/skill.git), cd skill, then run benchmarks using ./scripts/run.sh --model <model_name>.
Non-default prerequisites: Python 3.10+, uv package manager, and a running OpenClaw instance.
Relevant links: Leaderboard: pinchbench.com, OpenClaw: github.com/openclaw/openclaw, Issues: github.com/pinchbench/skill/issues.

Highlighted Details

Features 23 distinct real-world tasks spanning Productivity (Calendar, Summaries), Research (Stocks, Conferences), Writing (Blogs, Emails), Coding (Scripts, File Structures), Analysis (Spreadsheets, PDFs), Email Triage, Memory recall, and OpenClaw ecosystem Skills.
Focuses on evaluating key AI agent capabilities: effective tool usage, complex multi-step reasoning, resilience to ambiguous inputs, and successful completion of practical objectives.
Employs a dual grading system: automated checks for objective correctness and an LLM judge for nuanced evaluation of task execution and output quality.
Supports submission of results to a public leaderboard at pinchbench.com, fostering community-driven benchmarking and model comparison.

Maintenance & Community

Issues and contributions are managed via the GitHub repository: github.com/pinchbench/skill/issues. The project is developed by the team at kilo.ai.

Licensing & Compatibility

The project is licensed under the MIT License. This license is generally permissive and allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail known limitations, alpha status, or specific caveats. The focus is on defining criteria for contributing new, real-world, measurable, and reproducible tasks.

skill by pinchbench

Explore Similar Projects

GitTaskBench by QuantaAlpha

ClawBench by TIGER-AI-Lab

WorkArena by ServiceNow

mcpmark by eval-sys

WildClawBench by InternLM

waza by microsoft

Auto-GPT-Benchmarks by Significant-Gravitas

SWE-bench_Pro-os by scaleapi

terminal-bench-2 by harbor-framework

openbench by groq

agent-as-a-judge by metauto-ai

TheAgentCompany by TheAgentCompany