Discover and explore top open-source AI tools and projects—updated daily.
DeepExperienceMultimodal procedural knowledge for general visual agents
Top 85.0% on SourcePulse
MMSkills provides a framework for creating, managing, and deploying reusable multimodal procedural knowledge for general visual agents. It targets researchers and developers building agents that interact with graphical user interfaces, offering a way to enhance their task-completion capabilities through structured, multimodal skills. The primary benefit is enabling agents to perform complex GUI tasks more reliably and efficiently by leveraging a library of pre-defined, context-aware skills.
How It Works
MMSkills represents procedural knowledge as self-contained "skill packages," each including textual guidance, compact state-card metadata, and optional visual references. At inference time, the agent maintains lightweight skill hints. When a skill is deemed potentially useful, a temporary "skill branch" is activated. This branch consults relevant skills, determines the necessity of visual references, loads only the required state views, and then provides structured guidance back to the agent. This approach allows for efficient multimodal reasoning and keeps the agent's main context lean.
Quick Start & Requirements
Installation involves cloning the repository, installing Python dependencies (pip install -r requirements.txt), and then integrating with an OSWorld checkout using python3 scripts/install_into_osworld.py /path/to/OSWorld --with-runner --with-skills. An agent adapter for Codex is available via a one-line script. Prerequisites include Python 3.10+, an OSWorld installation, and access to an OpenAI-compatible or Gemini-compatible API endpoint for language models. Key resources include the project website MMSkills, the Skill Library Skill Library, and demo videos Demos.
Highlighted Details
Maintenance & Community
The project actively encourages community contributions for new skills across various domains like autonomous driving, robotics, and mobile agents. Submissions can be made via the project website or GitHub issues, ensuring a curated and normalized addition to the public library.
Licensing & Compatibility
MMSkills is released under the Apache License 2.0, which is permissive for commercial use and integration into closed-source projects. Some OSWorld integration components are derived from OSWorld itself.
Limitations & Caveats
Full functionality is dependent on integration with the OSWorld framework. The current public release focuses on a "compact multimodal desktop-skill subset" and is not a complete fork of OSWorld. Model-agnostic interfaces require explicit configuration of API endpoints and keys.
2 weeks ago
Inactive
VoltAgent