GUI-G2 by ZJU-REAL

Gaussian reward modeling for precise GUI grounding

Created 7 months ago

302 stars

Top 88.6% on SourcePulse

Project Summary

GUI-G² introduces a novel Gaussian reward modeling framework for training models to perform GUI grounding tasks. It addresses the limitations of traditional reinforcement learning rewards by mimicking human interaction patterns, specifically the Gaussian-like spatial distributions of clicks around targets. This approach offers a more precise and robust method for training models to accurately identify and interact with GUI elements, benefiting researchers and developers working on human-computer interaction, visual language models, and automated UI agents.

How It Works

GUI-G² employs a Gaussian reward framework inspired by human click behavior observed in datasets like AITW. The core innovation lies in its reward functions: Gaussian Point Reward, which rewards proximity to target centers, and Gaussian Coverage Reward, which encourages spatial alignment with the target area. An Adaptive Variance Mechanism dynamically adjusts the reward granularity based on the GUI element's scale. This dense reward signal provides smoother gradients compared to sparse, binary RL rewards, leading to more efficient and effective early-stage learning.

Quick Start & Requirements

Installation: Requires Python 3.10. Installation involves creating a conda environment (conda create -n gui-g2 python=3.10), activating it (conda activate gui-g2), and running bash setup.sh. Manual dependency installation includes transformers==4.49.0 and deepspeed==0.15.4.
Prerequisites: Python 3.10, transformers, deepspeed, and potentially CUDA-enabled hardware for efficient inference/training (as indicated by device_map="cuda").
Models: Pre-trained models GUI-G2-3B and GUI-G2-7B are available on Huggingface. Download commands are provided.
Links: Project Page: https://zju-real.github.io/GUI-G2, Code: https://github.com/zju-real/GUI-G2, Paper: https://arxiv.org/abs/2507.15846.

Highlighted Details

Achieves state-of-the-art performance on the ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro datasets.
Offers pre-trained models in 3B and 7B parameter sizes.
The Gaussian reward mechanism provides dense learning signals, improving gradient smoothness over binary RL rewards.

Maintenance & Community

The project announced its paper acceptance to AAAI 2026 in November 2025 and open-sourced its 3B and 7B models in August 2025, following the paper release in July 2025. The primary community and code repository is hosted on GitHub.

Licensing & Compatibility

The provided README does not specify a software license. This lack of explicit licensing information presents a significant blocker for evaluating commercial use or closed-source integration compatibility.

Limitations & Caveats

Evaluation checkpoints are noted as "will be released soon," indicating that the evaluation setup might still be under active development or not fully finalized. The project's association with AAAI 2026 suggests it is a recent research contribution and may still be evolving.

GUI-G2 by ZJU-REAL

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

UGround by OSU-NLP-Group

SeeClick by njucckevin

screen.vision by r-muresan

Aria-UI by AriaUI

ScreenSpot-Pro-GUI-Grounding by likaixin2000

GUI-Actor by microsoft

UI-Venus by inclusionAI

ShowUI by showlab

Magma by microsoft

UI-TARS by bytedance

OmniParser by microsoft