UGround  by OSU-NLP-Group

GUI visual grounding for GUI agents

created 1 year ago
262 stars

Top 97.8% on sourcepulse

GitHubView on GitHub
Project Summary

UGround provides a universal visual grounding solution for GUI agents, enabling them to accurately identify and interact with elements on screen. This project is targeted at researchers and developers building AI agents for tasks involving graphical user interfaces, offering state-of-the-art performance and a comprehensive evaluation suite.

How It Works

UGround leverages large multimodal models (LMMs), specifically fine-tuning Qwen2-VL, to perform visual grounding. The approach involves training the model on diverse GUI datasets to understand the spatial relationships between textual descriptions and visual elements within interfaces. This fine-tuning allows the model to output precise coordinates (x, y) for requested elements, facilitating agent navigation and interaction.

Quick Start & Requirements

  • Installation: Install dependencies via pip: pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830, pip install accelerate, pip install qwen-vl-utils, pip install 'vllm==0.6.1'.
  • Prerequisites: Requires Python and vLLM for efficient inference.
  • Resources: Model weights for Qwen2-VL-based UGround-V1 (2B, 7B, 72B) are available.
  • Demo: A live demo is available on Hugging Face Spaces.

Highlighted Details

  • Achieves state-of-the-art results on the ScreenSpot-Pro benchmark (18.9 -> 31.1).
  • Qwen2-VL-based UGround-V1 models (2B, 7B, 72B) are released.
  • Includes a comprehensive evaluation suite for GUI agents and grounding models.
  • Training data for UGround-V1 series has been released.

Maintenance & Community

This project is a collaboration between OSU NLP Group and Orby AI. Further details on community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The project cites papers from ICLR and ICML, suggesting a research-oriented release. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that the data synthesis pipeline is "Coming Soon," indicating it is not yet available. The inference code relies on specific versions of libraries like transformers and vLLM, which may require careful environment management.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
49 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), and
2 more.

OmniParser by microsoft

0.3%
23k
Screen parsing tool for vision-based GUI agents
created 10 months ago
updated 4 months ago
Feedback? Help us improve.