Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)
Top 29.5% on sourcepulse
ShowUI is an open-source, end-to-end Vision-Language-Action (VLA) model designed for GUI agents and computer use. It enables models to understand and interact with graphical user interfaces, facilitating tasks like automated computer operation and complex workflow execution. The project targets researchers and developers building AI agents capable of performing actions on desktop applications and web interfaces.
How It Works
ShowUI employs a VLA architecture that processes visual inputs (screenshots) and textual instructions to generate actionable outputs (e.g., mouse clicks, text input). It leverages large language models fine-tuned on GUI-specific datasets, allowing for nuanced understanding of UI elements and user intent. The model supports iterative refinement for improved grounding accuracy and can be integrated with tools like vllm for efficient inference.
Quick Start & Requirements
GRADIO.md
), vllm inference (inference_vllm.ipynb
), or via OOTB for computer control.Highlighted Details
Maintenance & Community
The project is actively maintained with frequent updates, including support for new models and inference backends. Community engagement is encouraged via GitHub stars and an active X (Twitter) presence.
Licensing & Compatibility
The project is open-source, with specific dataset licenses and model weights potentially varying. Compatibility for commercial use or closed-source linking should be verified against individual component licenses.
Limitations & Caveats
While the Gradio demo and API calling do not require a GPU, efficient inference and training likely benefit from or require GPU acceleration. The project is under active development, with potential for breaking changes as new features are integrated.
2 months ago
1 day