GUI-Actor by microsoft

Coordinate-free visual grounding for GUI agents

Created 7 months ago

370 stars

Top 76.4% on SourcePulse

Project Summary

This repository introduces GUI-Actor, a novel approach for visual grounding in GUI agents that moves beyond traditional coordinate-generation methods. It targets researchers and developers building AI agents for automating GUI interactions, offering improved spatial-semantic alignment and a more human-like interaction paradigm.

How It Works

GUI-Actor utilizes a Visual-Language Model (VLM) enhanced with an action head. This head performs coordinate-free grounding by attending to relevant visual regions, mimicking human perception rather than precise coordinate calculation. This approach allows for generating multiple candidate regions in a single forward pass, providing flexibility for downstream decision-making. A grounding verifier module is also included to refine action region selection.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n gui_actor python=3.10), activate it (conda activate gui_actor), install PyTorch with CUDA support, and then install the package (pip install -e .).
Dependencies: PyTorch with CUDA support, Transformers, Hugging Face datasets.
Data: Requires downloading processed data and updating data_config.yaml.
Hardware: Requires a CUDA-enabled GPU for training and inference.
Links: Project Page (implied by author links), Hugging Face Models

Highlighted Details

Achieves state-of-the-art performance on GUI action grounding benchmarks, surpassing UI-TARS-72B on ScreenSpot-Pro with Qwen2.5-VL backbone (44.6 vs 38.1).
Demonstrates generalization to unseen screen resolutions and layouts.
The grounding verifier can be integrated with other grounding methods for performance boosts.
Supports Qwen2-VL and Qwen2.5-VL backbones, with 2B, 3B, and 7B parameter variants.

Maintenance & Community

The project is primarily associated with Microsoft Research and Nanjing University. Key contributors are listed, with leadership indicated. The project page and Hugging Face models are linked.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README indicates that several components, including demo, processed training data, and full code releases, are still pending (as of June 2025). The project is actively being developed with planned releases for various features and model supports.

GUI-Actor by microsoft

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

UGround by OSU-NLP-Group

vla0 by NVlabs

SeeClick by njucckevin

OS-Atlas by OS-Copilot

ScreenSpot-Pro-GUI-Grounding by likaixin2000

Aria-UI by AriaUI

ScaleCUA by OpenGVLab

CogAgent by zai-org

ShowUI by showlab

Magma by microsoft

OmniParser by microsoft