OmniParser  by microsoft

Screen parsing tool for vision-based GUI agents

Created 1 year ago
24,805 stars

Top 1.9% on SourcePulse

GitHubView on GitHub
Project Summary

OmniParser provides a method for parsing UI screenshots into structured elements, enabling vision-based GUI agents like GPT-4V to accurately ground actions in specific interface regions. It targets developers building agents for computer use and offers improved action generation and interaction capabilities.

How It Works

OmniParser employs a two-stage approach: first, an interactive region detection model identifies UI elements, and second, an icon functional description model captions these elements. This allows for fine-grained parsing, including small icons and interactability prediction, which is crucial for precise agent control.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires Python 3.12 and downloading V2 model weights from Hugging Face.
  • Official demo available at HuggingFace Space Demo.

Highlighted Details

  • Achieves state-of-the-art results on the Screen Spot Pro grounding benchmark.
  • V1.5 adds fine-grained icon detection and interactability prediction.
  • OmniTool allows controlling Windows 11 VMs with OmniParser and various LLMs.
  • Supports local trajectory logging for agent training data pipelines.

Maintenance & Community

  • Active development with V2 checkpoints released in Feb 2025.
  • Project page and V2 blog post linked in the README.

Licensing & Compatibility

  • Model checkpoints are dual-licensed: icon_detect is AGPL (inherited from YOLO), while icon_caption models are MIT.
  • AGPL license may impose restrictions on commercial or closed-source use.

Limitations & Caveats

The AGPL license for the detection model may restrict its use in proprietary software. Documentation for new features like multi-agent orchestration is still in progress.

Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
154 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Max Liu Max Liu(Cofounder of PingCAP), and
2 more.

ShowUI by showlab

0.4%
2k
Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)
Created 1 year ago
Updated 1 month ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
7 more.

CogVLM by zai-org

0.0%
7k
VLM for image understanding and multi-turn dialogue
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.