MCP-Universe by SalesforceAIResearch

A framework for developing, testing, and benchmarking AI agents

Created 8 months ago

540 stars

Top 58.8% on SourcePulse

2 Experts Love This Project

shyamal-anadkat

Shyamal Anadkat

Research Scientist at OpenAI

lewtun

Research Engineer at Hugging Face

Project Summary

MCP-Universe is a comprehensive framework for developing, testing, and benchmarking AI agents and LLMs. It addresses the limitations of existing benchmarks by evaluating agents in real-world scenarios through interaction with MCP servers, focusing on long-horizon reasoning, large tool spaces, and dynamic environments. This framework is targeted at AI researchers and developers looking to rigorously assess and improve the capabilities of AI agents in complex, practical applications.

How It Works

MCP-Universe employs a modular architecture with distinct layers for agents, workflows, MCP servers, LLM integration, benchmarking, and a dashboard. Agents can be basic, ReAct-based, or function-call agents, with support for custom agent types. The workflow layer handles agent orchestration, enabling multi-agent collaboration. The framework integrates with multiple LLM providers and includes a benchmarking layer for evaluation, with a dashboard for visualization. This layered approach allows for flexibility in agent design and evaluation.

Quick Start & Requirements

Installation: Clone the repository, create and activate a virtual environment, and install dependencies using pip install -r requirements.txt and pip install -r dev-requirements.txt.
Prerequisites: Python 3.10+, Docker (for MCP servers), and optionally PostgreSQL and Redis. Platform-specific dependencies include libpq-dev on Linux and postgresql via Homebrew on macOS.
Configuration: Copy .env.example to .env and populate it with necessary API keys (e.g., OpenAI, Anthropic, Google Maps, SerpAPI, GitHub).
Documentation: Links to the paper, website, leaderboard, and Discord community are provided in the README.

Highlighted Details

Evaluates LLMs in real-world scenarios with actual MCP servers, addressing challenges like long-horizon reasoning and large tool spaces.
Supports a wide range of domains including web search, location navigation, browser automation, financial analysis, repository management, and 3D design.
Provides a flexible system for creating custom benchmarks with task definitions, agent/workflow definitions, and benchmark configurations.
Includes performance highlights showing success rates for state-of-the-art models like GPT-5, Grok-4, and Claude-4.0-Sonnet on real-world MCP interactions.

Maintenance & Community

The project is associated with Salesforce AI Research. Community interaction is facilitated through a Discord server.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing terms before use.

Limitations & Caveats

The README emphasizes security recommendations, particularly for GitHub integration, advising the use of a dedicated test account due to potential modifications of repositories by AI agents.
Blender operations for 3D design benchmarks may modify or create files, suggesting the use of isolated environments or backups.
The success rates reported for state-of-the-art models indicate significant room for improvement in current LLM agents for real-world MCP interactions.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

4

Issues (30d)

4

Star History

21 stars in the last 30 days

Explore Similar Projects

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect) and

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

AgentBoard by hkust-nlp

Analytical evaluation board for multi-turn LLM agents

Created 2 years ago

Updated 1 year ago

ML-Master by sjtu-sai-agents

AI agent for AI development and benchmarking

Created 6 months ago

Updated 2 weeks ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect) and

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face).

meta-agents-research-environments by facebookresearch

Platform for evaluating AI agents in dynamic, realistic scenarios

Created 4 months ago

Updated 1 month ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow) and

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect).

factorio-learning-environment by JackHopkins

Framework for evaluating LLM agents in Factorio

Created 4 years ago

Updated 6 days ago

Starred by

Yiran Wu

Yiran Wu(Coauthor of AutoGen) and

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind).

Awesome-AgenticLLM-RL-Papers by xhyumiracle

Surveying the landscape of agentic reinforcement learning for LLMs

Created 4 months ago

Updated 4 months ago

Claude-Code-Workflow by catlog22

Automated multi-agent development framework orchestrating complex AI workflows

Created 4 months ago

Updated 19 hours ago

Awesome-AI-Agents by Jenqyang

Collection of autonomous agents powered by LLMs

Created 2 years ago

Updated 2 weeks ago

rogue by qualifire-dev

AI agent evaluation framework

Created 7 months ago

Updated 2 days ago

Starred by

Jon Bratseth

Jon Bratseth(Cofounder of Vespa).

KwaiAgents by KwaiKEG

Agent framework for information-seeking using LLMs

Created 2 years ago

Updated 1 year ago

Starred by

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow).

ai-agents-from-scratch by pguso

AI agents demystified through local, from-scratch implementation

Created 2 months ago

Updated 3 days ago

Starred by

Gregor Zunic

Gregor Zunic(Cofounder of Browser Use).

all-agentic-architectures by FareedKhan-dev

Master AI agent design with practical, runnable architectures

Created 3 months ago

Updated 3 months ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

Agent-Skills-for-Context-Engineering by muratcankoylan

Agent skills for context engineering and multi-agent systems

Created 3 weeks ago

Updated 4 days ago

Feedback? Help us improve.