mcpmark  by eval-sys

Evaluate agentic models across diverse real-world tool environments

Created 3 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

MCPMark is a comprehensive benchmark suite designed to stress-test and evaluate agentic models within real-world Multi-Cloud Platform (MCP) tool environments. It targets researchers and engineers, providing a reproducible, extensible framework for assessing model capabilities across services like Notion, GitHub, Filesystem, Postgres, and Playwright. The benchmark offers automated verification, unified metrics, and aggregated reports, simplifying the evaluation of complex agent behaviors.

How It Works

The system evaluates agentic models by executing ready-to-run tasks across integrated MCP services. It employs isolated sandboxes to prevent data pollution and features auto-resume for failed tasks, ensuring reliability. Each task includes strict, automated verification for objective assessment. Core advantages include reproducible results, unified metrics (e.g., pass@k, avg@k), and aggregated reports, facilitating robust model performance analysis and comparison.

Quick Start & Requirements

  • Primary Install: Local (pip install -e .) or Docker (./build-docker.sh).
  • Prerequisites: Environment variables for service credentials (e.g., OPENAI_API_KEY, GITHUB_TOKENS), Playwright browsers (playwright install), and models configured via LiteLLM.
  • Setup: Quickstart is estimated at 5 minutes.
  • Links: Official Docs (https://mcpmark.ai/docs), Website (https://mcpmark.ai), Hugging Face Trajectory Logs (https://huggingface.co/datasets/Jakumetsu/mcpmark-trajectory-log).

Highlighted Details

  • Supports evaluation across Notion, GitHub, Filesystem, Postgres, and Playwright MCP services.
  • Features one-command tasks with strict, automated verification and isolated execution environments.
  • Includes auto-resume for failed tasks and unified metrics for single/multi-run evaluations (pass@k, avg@k).
  • Flexible deployment options: local (macOS/Linux validated) and Docker.

Maintenance & Community

  • Community support is available via Discord (https://discord.gg/HrKkJAxDnA).
  • Project details and research findings are linked via its website (https://mcpmark.ai) and an arXiv preprint (https://arxiv.org/abs/2509.24002).
  • Contribution guidelines are provided in docs/contributing/make-contribution.md.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • This license is permissive, generally allowing for commercial use and integration into closed-source projects.

Limitations & Caveats

  • The system relies on specific error string patterns for auto-resume; new or unhandled errors may require manual intervention or contribution.
  • Final aggregated reports only include models with complete results across all tasks and runs.
  • The arXiv paper's 2025 date suggests the project may be in active development or early research stages.
Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
14
Issues (30d)
2
Star History
111 stars in the last 30 days

Explore Similar Projects

Starred by Maxime Labonne Maxime Labonne(Head of Post-Training at Liquid AI), Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), and
5 more.

openbench by groq

2.8%
590
Provider-agnostic LLM evaluation infrastructure
Created 2 months ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
2 more.

tau-bench by sierra-research

1.8%
881
Benchmark for tool-agent-user interaction research
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.