MMSkills  by DeepExperience

Multimodal procedural knowledge for general visual agents

Created 1 month ago
318 stars

Top 85.0% on SourcePulse

GitHubView on GitHub
Project Summary

MMSkills provides a framework for creating, managing, and deploying reusable multimodal procedural knowledge for general visual agents. It targets researchers and developers building agents that interact with graphical user interfaces, offering a way to enhance their task-completion capabilities through structured, multimodal skills. The primary benefit is enabling agents to perform complex GUI tasks more reliably and efficiently by leveraging a library of pre-defined, context-aware skills.

How It Works

MMSkills represents procedural knowledge as self-contained "skill packages," each including textual guidance, compact state-card metadata, and optional visual references. At inference time, the agent maintains lightweight skill hints. When a skill is deemed potentially useful, a temporary "skill branch" is activated. This branch consults relevant skills, determines the necessity of visual references, loads only the required state views, and then provides structured guidance back to the agent. This approach allows for efficient multimodal reasoning and keeps the agent's main context lean.

Quick Start & Requirements

Installation involves cloning the repository, installing Python dependencies (pip install -r requirements.txt), and then integrating with an OSWorld checkout using python3 scripts/install_into_osworld.py /path/to/OSWorld --with-runner --with-skills. An agent adapter for Codex is available via a one-line script. Prerequisites include Python 3.10+, an OSWorld installation, and access to an OpenAI-compatible or Gemini-compatible API endpoint for language models. Key resources include the project website MMSkills, the Skill Library Skill Library, and demo videos Demos.

Highlighted Details

  • Self-contained skill packages with procedural descriptions, state metadata, and visual keyframes.
  • Multimodal evidence gating and branch-loaded planning for dynamic visual reference utilization.
  • OSWorld integration, including runner patches and task-to-skill mappings.
  • An agent adapter supporting Codex, OpenClaw, and Claude Code, enabling on-demand retrieval from a 515-skill Hugging Face library.
  • A community-extensible skill library with a review-first publishing process.

Maintenance & Community

The project actively encourages community contributions for new skills across various domains like autonomous driving, robotics, and mobile agents. Submissions can be made via the project website or GitHub issues, ensuring a curated and normalized addition to the public library.

Licensing & Compatibility

MMSkills is released under the Apache License 2.0, which is permissive for commercial use and integration into closed-source projects. Some OSWorld integration components are derived from OSWorld itself.

Limitations & Caveats

Full functionality is dependent on integration with the OSWorld framework. The current public release focuses on a "compact multimodal desktop-skill subset" and is not a complete fork of OSWorld. Model-agnostic interfaces require explicit configuration of API endpoints and keys.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
315 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.