skill  by pinchbench

Benchmarking system for AI coding agents

Created 1 month ago
480 stars

Top 64.0% on SourcePulse

GitHubView on GitHub
Project Summary

PinchBench is a benchmarking system designed to evaluate Large Language Models (LLMs) specifically as OpenClaw coding agents. It addresses the limitations of traditional, synthetic benchmarks by testing LLMs on real-world tasks such as scheduling, coding, email triaging, and file management, providing a practical measure of their utility and performance in agentic applications.

How It Works

PinchBench evaluates LLM models as the cognitive engine for OpenClaw agents by executing a suite of real-world tasks. Its core design prioritizes practical agent performance over isolated LLM capabilities. The system rigorously tests crucial agent functionalities: accurate tool invocation with correct parameters, the ability to chain actions for multi-step reasoning, robust handling of real-world data messiness and ambiguous instructions, and ultimately, the achievement of tangible, practical outcomes like file manipulation or communication. Each task is designed for automatic grading, often augmented by an LLM judge, ensuring both objective accuracy and nuanced assessment of agent behavior.

Quick Start & Requirements

  • Primary install/run command: Clone the repository (git clone https://github.com/pinchbench/skill.git), cd skill, then run benchmarks using ./scripts/run.sh --model <model_name>.
  • Non-default prerequisites: Python 3.10+, uv package manager, and a running OpenClaw instance.
  • Relevant links: Leaderboard: pinchbench.com, OpenClaw: github.com/openclaw/openclaw, Issues: github.com/pinchbench/skill/issues.

Highlighted Details

  • Features 23 distinct real-world tasks spanning Productivity (Calendar, Summaries), Research (Stocks, Conferences), Writing (Blogs, Emails), Coding (Scripts, File Structures), Analysis (Spreadsheets, PDFs), Email Triage, Memory recall, and OpenClaw ecosystem Skills.
  • Focuses on evaluating key AI agent capabilities: effective tool usage, complex multi-step reasoning, resilience to ambiguous inputs, and successful completion of practical objectives.
  • Employs a dual grading system: automated checks for objective correctness and an LLM judge for nuanced evaluation of task execution and output quality.
  • Supports submission of results to a public leaderboard at pinchbench.com, fostering community-driven benchmarking and model comparison.

Maintenance & Community

Issues and contributions are managed via the GitHub repository: github.com/pinchbench/skill/issues. The project is developed by the team at kilo.ai.

Licensing & Compatibility

The project is licensed under the MIT License. This license is generally permissive and allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly detail known limitations, alpha status, or specific caveats. The focus is on defining criteria for contributing new, real-world, measurable, and reproducible tasks.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
31
Issues (30d)
17
Star History
494 stars in the last 30 days

Explore Similar Projects

Starred by Dan Abramov Dan Abramov(Core Contributor to React; Coauthor of Redux, Create React App), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
9 more.

terminal-bench by harbor-framework

2.4%
2k
Benchmark for LLM agents in real terminal environments
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.