MCP-Universe  by SalesforceAIResearch

A framework for developing, testing, and benchmarking AI agents

Created 4 months ago
423 stars

Top 69.6% on SourcePulse

GitHubView on GitHub
Project Summary

MCP-Universe is a comprehensive framework for developing, testing, and benchmarking AI agents and LLMs. It addresses the limitations of existing benchmarks by evaluating agents in real-world scenarios through interaction with MCP servers, focusing on long-horizon reasoning, large tool spaces, and dynamic environments. This framework is targeted at AI researchers and developers looking to rigorously assess and improve the capabilities of AI agents in complex, practical applications.

How It Works

MCP-Universe employs a modular architecture with distinct layers for agents, workflows, MCP servers, LLM integration, benchmarking, and a dashboard. Agents can be basic, ReAct-based, or function-call agents, with support for custom agent types. The workflow layer handles agent orchestration, enabling multi-agent collaboration. The framework integrates with multiple LLM providers and includes a benchmarking layer for evaluation, with a dashboard for visualization. This layered approach allows for flexibility in agent design and evaluation.

Quick Start & Requirements

  • Installation: Clone the repository, create and activate a virtual environment, and install dependencies using pip install -r requirements.txt and pip install -r dev-requirements.txt.
  • Prerequisites: Python 3.10+, Docker (for MCP servers), and optionally PostgreSQL and Redis. Platform-specific dependencies include libpq-dev on Linux and postgresql via Homebrew on macOS.
  • Configuration: Copy .env.example to .env and populate it with necessary API keys (e.g., OpenAI, Anthropic, Google Maps, SerpAPI, GitHub).
  • Documentation: Links to the paper, website, leaderboard, and Discord community are provided in the README.

Highlighted Details

  • Evaluates LLMs in real-world scenarios with actual MCP servers, addressing challenges like long-horizon reasoning and large tool spaces.
  • Supports a wide range of domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.
  • Provides a flexible system for creating custom benchmarks with task definitions, agent/workflow definitions, and benchmark configurations.
  • Includes performance highlights showing success rates for state-of-the-art models like GPT-5, Grok-4, and Claude-4.0-Sonnet on real-world MCP interactions.

Maintenance & Community

The project is associated with Salesforce AI Research. Community interaction is facilitated through a Discord server.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing terms before use.

Limitations & Caveats

  • The README emphasizes security recommendations, particularly for GitHub integration, advising the use of a dedicated test account due to potential modifications of repositories by AI agents.
  • Blender operations for 3D design benchmarks may modify or create files, suggesting the use of isolated environments or backups.
  • The success rates reported for state-of-the-art models indicate significant room for improvement in current LLM agents for real-world MCP interactions.
Health Check
Last Commit

19 hours ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
6
Star History
424 stars in the last 30 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

SuperAGI by TransformerOptimus

0.1%
17k
Open-source framework for autonomous AI agent development
Created 2 years ago
Updated 7 months ago
Feedback? Help us improve.