SeeClick  by njucckevin

Visual GUI agent for grounding and interacting with graphical user interfaces

created 1 year ago
406 stars

Top 72.7% on sourcepulse

GitHubView on GitHub
Project Summary

SeeClick provides the model, data, and code for a visual GUI agent that leverages GUI grounding for advanced interaction. It is designed for researchers and developers working on multimodal AI agents that need to understand and interact with graphical user interfaces across various platforms. The project offers a benchmark, pre-training data, and inference code to facilitate the development of such agents.

How It Works

SeeClick is built upon the Qwen-VL architecture, a large vision-language model. It enhances Qwen-VL's capabilities by fine-tuning it on a large-scale GUI grounding dataset. This fine-tuning process teaches the model to accurately identify and locate specific UI elements (text or icons/widgets) based on natural language instructions, enabling it to predict click points or bounding boxes within interface screenshots.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Inference requires a CUDA-enabled GPU.
  • Model checkpoint available on Hugging Face: cckevinn/SeeClick
  • Official documentation and examples are available in the repository.

Highlighted Details

  • Introduces ScreenSpot, a GUI grounding benchmark with over 1200 instructions across iOS, Android, macOS, Windows, and Web.
  • Achieves state-of-the-art performance on GUI grounding tasks, outperforming models like GPT-4V and CogAgent on average across various metrics.
  • Provides a large-scale GUI grounding pre-training dataset, including a corpus from Common Crawl.
  • Offers code for pre-training, fine-tuning (including LoRA), and evaluation on downstream agent tasks.

Maintenance & Community

  • The project is associated with ACL 2024.
  • Further details on downstream agent tasks and fine-tuning are available in the repository.

Licensing & Compatibility

  • The project incorporates datasets and checkpoints governed by their original licenses. Users must adhere to all specified terms.
  • Compatibility for commercial use or closed-source linking depends on the licenses of the incorporated datasets and checkpoints.

Limitations & Caveats

  • The project's licensing is complex due to the incorporation of multiple datasets and checkpoints, requiring careful review of each component's license.
  • While SeeClick is primarily trained for predicting click points, its performance on bounding box prediction may vary.
Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
40 stars in the last 90 days

Explore Similar Projects

Starred by Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind), Joe Walnes Joe Walnes(Head of Experimental Projects at Stripe), and
2 more.

OmniParser by microsoft

0.3%
23k
Screen parsing tool for vision-based GUI agents
created 10 months ago
updated 4 months ago
Feedback? Help us improve.