GUI-Actor  by microsoft

Coordinate-free visual grounding for GUI agents

Created 4 months ago
334 stars

Top 82.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository introduces GUI-Actor, a novel approach for visual grounding in GUI agents that moves beyond traditional coordinate-generation methods. It targets researchers and developers building AI agents for automating GUI interactions, offering improved spatial-semantic alignment and a more human-like interaction paradigm.

How It Works

GUI-Actor utilizes a Visual-Language Model (VLM) enhanced with an action head. This head performs coordinate-free grounding by attending to relevant visual regions, mimicking human perception rather than precise coordinate calculation. This approach allows for generating multiple candidate regions in a single forward pass, providing flexibility for downstream decision-making. A grounding verifier module is also included to refine action region selection.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n gui_actor python=3.10), activate it (conda activate gui_actor), install PyTorch with CUDA support, and then install the package (pip install -e .).
  • Dependencies: PyTorch with CUDA support, Transformers, Hugging Face datasets.
  • Data: Requires downloading processed data and updating data_config.yaml.
  • Hardware: Requires a CUDA-enabled GPU for training and inference.
  • Links: Project Page (implied by author links), Hugging Face Models

Highlighted Details

  • Achieves state-of-the-art performance on GUI action grounding benchmarks, surpassing UI-TARS-72B on ScreenSpot-Pro with Qwen2.5-VL backbone (44.6 vs 38.1).
  • Demonstrates generalization to unseen screen resolutions and layouts.
  • The grounding verifier can be integrated with other grounding methods for performance boosts.
  • Supports Qwen2-VL and Qwen2.5-VL backbones, with 2B, 3B, and 7B parameter variants.

Maintenance & Community

The project is primarily associated with Microsoft Research and Nanjing University. Key contributors are listed, with leadership indicated. The project page and Hugging Face models are linked.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README indicates that several components, including demo, processed training data, and full code releases, are still pending (as of June 2025). The project is actively being developed with planned releases for various features and model supports.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.