ScreenSpot-Pro-GUI-Grounding  by likaixin2000

GUI grounding for professional high-resolution computer interaction

Created 9 months ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> ScreenSpot-Pro-GUI-Grounding addresses the challenge of precise GUI grounding for professional, high-resolution computer environments. It introduces the SE-GUI model, offering enhanced accuracy in understanding and interacting with graphical user interfaces. This project is beneficial for researchers and developers in UI automation, multimodal AI, and large language model applications seeking robust GUI comprehension.

How It Works

The project's core innovation is the SE-GUI model, which achieves notable accuracy figures: 47.2% with a 7B parameter version and 35.9% with a 3B version, trained on a dataset of 3,000 open-source samples. It supports diverse interaction paradigms through ScreenSpot-v2-variants, incorporating original instructions, action-based commands, target UI descriptions, and negative instructions, enabling flexible and nuanced GUI control.

Quick Start & Requirements

  • Setup: Requires setting the OPENAI_API_KEY environment variable.
  • Evaluation: Evaluation can be initiated using provided shell scripts, such as run_ss_pro.sh.
  • Prerequisites: An OpenAI API key is a mandatory requirement. Specific hardware or software dependencies beyond this are not detailed in the provided text.
  • Documentation: An arXiv paper is referenced for further details, though a direct URL is not supplied.

Highlighted Details

  • The SE-GUI model demonstrates strong performance, achieving 47.2% accuracy with a 7B model and 35.9% with a 3B model, trained on a modest 3k sample dataset.
  • The project's methodology and results are recognized as a benchmark within several prominent AI projects, including Omniparser v2, Qwen2.5-VL, UI-TARS, UGround, and AGUVIS.
  • Offers flexibility through ScreenSpot-v2-variants, supporting multiple instruction styles like original, action, target UI description, and negative instructions for varied user interaction needs.

Maintenance & Community

The provided README content does not contain information regarding project maintainers, community channels (like Discord or Slack), roadmaps, or notable contributors.

Licensing & Compatibility

No specific license information is mentioned in the README, which may require further investigation for adoption decisions.

Limitations & Caveats

The README does not explicitly state any limitations or known issues. However, the relatively small training dataset size (3,000 samples) for the SE-GUI model might warrant consideration regarding its generalization capabilities across a wider range of GUIs and scenarios.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
2 more.

UI-TARS-desktop by bytedance

1.1%
19k
GUI agent app for computer control via natural language
Created 8 months ago
Updated 16 hours ago
Feedback? Help us improve.