human-mcp by mrgoonie

Human-like multimodal capabilities for AI agents

Created 9 months ago

290 stars

Top 90.8% on SourcePulse

Project Summary

This project provides a comprehensive Model Context Protocol (MCP) server, Human MCP, designed to equip AI coding agents with human-like multimodal capabilities. It addresses the gap in current AI agents by integrating visual analysis, document processing, content creation, speech generation, browser automation, and advanced reasoning, enabling more sophisticated debugging, understanding, and enhancement of multimodal content. The target audience includes AI developers and users seeking to empower their AI agents with a richer set of human-like functionalities.

How It Works

Human MCP acts as a middleware, exposing 29 distinct tools categorized into four human capabilities: Eyes (visual/document analysis), Hands (content generation/editing/automation), Mouth (speech generation), and Brain (advanced reasoning). It leverages a diverse technology stack, including Google Gemini (for vision, document, speech, image, and video processing), Imagen API, Veo API, ElevenLabs, Minimax, ZhipuAI, Playwright for browser automation, and Jimp for local image manipulation. A key advantage is its multi-provider support, allowing users to select preferred AI models for each capability, offering flexibility and cost optimization.

Quick Start & Requirements

Primary Install/Run: Typically via npx @goonnguyen/human-mcp or bun run dev for development.
Prerequisites: Node.js v22+ or Bun v1.2+. A Google Gemini API key is essential for core functionalities. Additional API keys for Minimax, ZhipuAI, and ElevenLabs may be required for specific tools.
Setup: Configuration involves setting API keys as environment variables (e.g., GOOGLE_GEMINI_API_KEY) or within client-specific configuration files. Detailed setup guides are available for various MCP clients like Claude Desktop, Claude Code CLI, and Cursor.
Links: Google AI Studio for API key generation.

Highlighted Details

29 Production-Ready MCP Tools: Covering visual analysis, document processing, image/video/music generation, speech synthesis, browser automation, and advanced reasoning.
Multi-Provider Support: Integrates with Google Gemini, Minimax, ZhipuAI, and ElevenLabs, enabling flexible AI model selection per capability.
Comprehensive Modalities: Offers tools for image/video analysis, document extraction/summarization, AI-powered image editing, video and music generation, text-to-speech, and browser automation for web screenshots.
Advanced Reasoning: Includes native sequential thinking and AI-powered reflection for complex problem-solving.

Maintenance & Community

The project outlines a "Development Roadmap & Vision" and encourages community involvement through "Getting Involved" sections, including issue reporting and discussions. While specific contributors or sponsorships are not detailed, the roadmap indicates ongoing development towards completing the human sensory suite with planned audio processing capabilities.

Licensing & Compatibility

The project is released under the MIT License, which generally permits commercial use and modification. Users should consult the terms of service for any third-party AI provider APIs used.

Limitations & Caveats

The project is actively under development, with audio processing ("Ears") planned for Q1 2025, indicating that not all core human sensory capabilities are yet implemented. Setup requires obtaining and configuring multiple API keys, which may incur costs from AI service providers. While stdio transport is available, HTTP transport with Cloudflare R2 integration is detailed for certain clients, adding a dependency for cloud-based file handling.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days