mcp-bench  by Accenture

Evaluate LLM agents' tool-use capabilities

Created 3 weeks ago

New!

314 stars

Top 85.9% on SourcePulse

GitHubView on GitHub
Project Summary

MCP-Bench is an evaluation framework for benchmarking Large Language Model (LLM) agents' tool-use capabilities in complex, real-world tasks via the Model Context Protocol (MCP). It offers an end-to-end pipeline for assessing how effectively LLMs discover, select, and utilize tools, providing valuable insights for researchers and developers in the LLM agent space.

How It Works

The framework employs the Model Context Protocol (MCP) to facilitate LLM interaction with a suite of 28 diverse real-world services, referred to as MCP servers. It orchestrates benchmark runs, evaluating LLMs on their ability to understand task requirements, select appropriate tools from available MCP servers, and execute them effectively. This approach allows for systematic measurement of tool-use proficiency across various complex scenarios.

Quick Start & Requirements

Installation involves cloning the repository (https://github.com/accenture/mcp-bench.git), setting up a Conda environment with Python 3.10, and running an installation script for MCP servers. Key requirements include obtaining API keys for multiple services (OpenRouter, Azure OpenAI, NPS, NASA, Hugging Face, Google Maps, NCI). Some API key registration portals may require a US IP address. Model providers can be browsed at https://openrouter.ai/models.

Highlighted Details

  • Benchmarks LLM agents' tool-use capabilities across 28 diverse MCP servers, including Google Maps, Hugging Face, and NASA Data.
  • Features a leaderboard showcasing performance scores for leading LLMs like GPT-5, Gemini-2.5-Pro, and Claude-Sonnet-4.
  • Supports extensibility by allowing the addition of new model providers, such as integrating Azure models via OpenRouter.
  • Evaluates LLMs across multiple dimensions including schema understanding, task completion, tool usage, and planning effectiveness.

Maintenance & Community

The project's primary documentation is the README. Specific details regarding active maintenance, community channels (like Discord or Slack), or sponsorships are not explicitly detailed in the provided text. The citation lists several authors, indicating research contributions.

Licensing & Compatibility

The specific open-source license for the MCP-Bench repository is not explicitly stated in the provided README content.

Limitations & Caveats

Acquiring necessary API keys for certain services might present challenges, as some registration portals may require a US IP address, as noted in Issue #10. The setup process necessitates configuring multiple external API keys, which can add complexity to the initial deployment.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
7
Issues (30d)
8
Star History
314 stars in the last 22 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Magnus Müller Magnus Müller(Cofounder of Browser Use), and
83 more.

langchain by langchain-ai

0.4%
116k
Framework for building LLM-powered applications
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.