UGround by OSU-NLP-Group

GUI visual grounding for GUI agents

Created 1 year ago

291 stars

Top 90.7% on SourcePulse

Project Summary

UGround provides a universal visual grounding solution for GUI agents, enabling them to accurately identify and interact with elements on screen. This project is targeted at researchers and developers building AI agents for tasks involving graphical user interfaces, offering state-of-the-art performance and a comprehensive evaluation suite.

How It Works

UGround leverages large multimodal models (LMMs), specifically fine-tuning Qwen2-VL, to perform visual grounding. The approach involves training the model on diverse GUI datasets to understand the spatial relationships between textual descriptions and visual elements within interfaces. This fine-tuning allows the model to output precise coordinates (x, y) for requested elements, facilitating agent navigation and interaction.

Quick Start & Requirements

Installation: Install dependencies via pip: pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830, pip install accelerate, pip install qwen-vl-utils, pip install 'vllm==0.6.1'.
Prerequisites: Requires Python and vLLM for efficient inference.
Resources: Model weights for Qwen2-VL-based UGround-V1 (2B, 7B, 72B) are available.
Demo: A live demo is available on Hugging Face Spaces.

Highlighted Details

Achieves state-of-the-art results on the ScreenSpot-Pro benchmark (18.9 -> 31.1).
Qwen2-VL-based UGround-V1 models (2B, 7B, 72B) are released.
Includes a comprehensive evaluation suite for GUI agents and grounding models.
Training data for UGround-V1 series has been released.

Maintenance & Community

This project is a collaboration between OSU NLP Group and Orby AI. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The project cites papers from ICLR and ICML, suggesting a research-oriented release. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that the data synthesis pipeline is "Coming Soon," indicating it is not yet available. The inference code relies on specific versions of libraries like transformers and vLLM, which may require careful environment management.

UGround by OSU-NLP-Group

Explore Similar Projects

GUI-G2 by ZJU-REAL

SeeClick by njucckevin

OS-Atlas by OS-Copilot

ScreenSpot-Pro-GUI-Grounding by likaixin2000

Aria-UI by AriaUI

GUI-Actor by microsoft

Awesome-GUI-Agent by showlab

ScaleCUA by OpenGVLab

CogAgent by zai-org

ShowUI by showlab

CogVLM by zai-org

OmniParser by microsoft