Discover and explore top open-source AI tools and projects—updated daily.
Visual GUI agent for grounding and interacting with graphical user interfaces
Top 69.5% on SourcePulse
SeeClick provides the model, data, and code for a visual GUI agent that leverages GUI grounding for advanced interaction. It is designed for researchers and developers working on multimodal AI agents that need to understand and interact with graphical user interfaces across various platforms. The project offers a benchmark, pre-training data, and inference code to facilitate the development of such agents.
How It Works
SeeClick is built upon the Qwen-VL architecture, a large vision-language model. It enhances Qwen-VL's capabilities by fine-tuning it on a large-scale GUI grounding dataset. This fine-tuning process teaches the model to accurately identify and locate specific UI elements (text or icons/widgets) based on natural language instructions, enabling it to predict click points or bounding boxes within interface screenshots.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 months ago
1 day