SeeClick provides the model, data, and code for a visual GUI agent that leverages GUI grounding for advanced interaction. It is designed for researchers and developers working on multimodal AI agents that need to understand and interact with graphical user interfaces across various platforms. The project offers a benchmark, pre-training data, and inference code to facilitate the development of such agents.
How It Works
SeeClick is built upon the Qwen-VL architecture, a large vision-language model. It enhances Qwen-VL's capabilities by fine-tuning it on a large-scale GUI grounding dataset. This fine-tuning process teaches the model to accurately identify and locate specific UI elements (text or icons/widgets) based on natural language instructions, enabling it to predict click points or bounding boxes within interface screenshots.
Quick Start & Requirements
- Install dependencies:
pip install -r requirements.txt
- Inference requires a CUDA-enabled GPU.
- Model checkpoint available on Hugging Face: cckevinn/SeeClick
- Official documentation and examples are available in the repository.
Highlighted Details
- Introduces ScreenSpot, a GUI grounding benchmark with over 1200 instructions across iOS, Android, macOS, Windows, and Web.
- Achieves state-of-the-art performance on GUI grounding tasks, outperforming models like GPT-4V and CogAgent on average across various metrics.
- Provides a large-scale GUI grounding pre-training dataset, including a corpus from Common Crawl.
- Offers code for pre-training, fine-tuning (including LoRA), and evaluation on downstream agent tasks.
Maintenance & Community
- The project is associated with ACL 2024.
- Further details on downstream agent tasks and fine-tuning are available in the repository.
Licensing & Compatibility
- The project incorporates datasets and checkpoints governed by their original licenses. Users must adhere to all specified terms.
- Compatibility for commercial use or closed-source linking depends on the licenses of the incorporated datasets and checkpoints.
Limitations & Caveats
- The project's licensing is complex due to the incorporation of multiple datasets and checkpoints, requiring careful review of each component's license.
- While SeeClick is primarily trained for predicting click points, its performance on bounding box prediction may vary.