SeeClick by njucckevin

Visual GUI agent for grounding and interacting with graphical user interfaces

Created 2 years ago

452 stars

Top 66.6% on SourcePulse

Project Summary

SeeClick provides the model, data, and code for a visual GUI agent that leverages GUI grounding for advanced interaction. It is designed for researchers and developers working on multimodal AI agents that need to understand and interact with graphical user interfaces across various platforms. The project offers a benchmark, pre-training data, and inference code to facilitate the development of such agents.

How It Works

SeeClick is built upon the Qwen-VL architecture, a large vision-language model. It enhances Qwen-VL's capabilities by fine-tuning it on a large-scale GUI grounding dataset. This fine-tuning process teaches the model to accurately identify and locate specific UI elements (text or icons/widgets) based on natural language instructions, enabling it to predict click points or bounding boxes within interface screenshots.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Inference requires a CUDA-enabled GPU.
Model checkpoint available on Hugging Face: cckevinn/SeeClick
Official documentation and examples are available in the repository.

Highlighted Details

Introduces ScreenSpot, a GUI grounding benchmark with over 1200 instructions across iOS, Android, macOS, Windows, and Web.
Achieves state-of-the-art performance on GUI grounding tasks, outperforming models like GPT-4V and CogAgent on average across various metrics.
Provides a large-scale GUI grounding pre-training dataset, including a corpus from Common Crawl.
Offers code for pre-training, fine-tuning (including LoRA), and evaluation on downstream agent tasks.

Maintenance & Community

The project is associated with ACL 2024.
Further details on downstream agent tasks and fine-tuning are available in the repository.

Licensing & Compatibility

The project incorporates datasets and checkpoints governed by their original licenses. Users must adhere to all specified terms.
Compatibility for commercial use or closed-source linking depends on the licenses of the incorporated datasets and checkpoints.

Limitations & Caveats

The project's licensing is complex due to the incorporation of multiple datasets and checkpoints, requiring careful review of each component's license.
While SeeClick is primarily trained for predicting click points, its performance on bounding box prediction may vary.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

7 stars in the last 30 days

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

A curated collection for developing advanced GUI agents

Created 9 months ago

Updated 1 day ago

UGround by OSU-NLP-Group

GUI visual grounding for GUI agents

Created 1 year ago

Updated 5 months ago

Starred by

Binyuan Hui

Binyuan Hui(Research Scientist at Alibaba Qwen).

awesome-computer-use by ranpox

Resources for GUI computer-use agents

Created 1 year ago

Updated 2 months ago

Aria-UI by AriaUI

GUI agent for context-aware action grounding from instructions

Created 1 year ago

Updated 11 months ago

Awesome-GUI-Agent by showlab

GUI agent resource list

Created 1 year ago

Updated 4 months ago

ScaleCUA by OpenGVLab

Cross-platform computer use agents for GUI automation

Created 4 months ago

Updated 4 days ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

CogAgent by zai-org

VLM-based GUI agent for automating graphical user interfaces

Created 2 years ago

Updated 9 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

surf.new by steel-dev

Web agent playground

Created 11 months ago

Updated 5 months ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Travis Fischer

Travis Fischer(Founder of Agentic), and

2 more.

acu by trycua

Curated list of AI agents for computer use, frameworks, and tools

Created 1 year ago

Updated 3 months ago

Starred by

Max Liu

Max Liu(Cofounder of PingCAP),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

1 more.

ShowUI by showlab

Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)

Created 1 year ago

Updated 7 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang),

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI), and

1 more.

webarena by web-arena-x

Web environment for autonomous agent development

Created 2 years ago

Updated 1 month ago

fara by microsoft

Agentic model for visual computer task automation

Created 2 months ago

Updated 3 weeks ago

Feedback? Help us improve.