Discover and explore top open-source AI tools and projects—updated daily.
Coordinate-free visual grounding for GUI agents
Top 82.1% on SourcePulse
This repository introduces GUI-Actor, a novel approach for visual grounding in GUI agents that moves beyond traditional coordinate-generation methods. It targets researchers and developers building AI agents for automating GUI interactions, offering improved spatial-semantic alignment and a more human-like interaction paradigm.
How It Works
GUI-Actor utilizes a Visual-Language Model (VLM) enhanced with an action head. This head performs coordinate-free grounding by attending to relevant visual regions, mimicking human perception rather than precise coordinate calculation. This approach allows for generating multiple candidate regions in a single forward pass, providing flexibility for downstream decision-making. A grounding verifier module is also included to refine action region selection.
Quick Start & Requirements
conda create -n gui_actor python=3.10
), activate it (conda activate gui_actor
), install PyTorch with CUDA support, and then install the package (pip install -e .
).data_config.yaml
.Highlighted Details
Maintenance & Community
The project is primarily associated with Microsoft Research and Nanjing University. Key contributors are listed, with leadership indicated. The project page and Hugging Face models are linked.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README indicates that several components, including demo, processed training data, and full code releases, are still pending (as of June 2025). The project is actively being developed with planned releases for various features and model supports.
1 month ago
Inactive