ShowUI by showlab

Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)

Created 1 year ago

1,637 stars

Top 25.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Max Liu

Cofounder of PingCAP

Thomas Wolf

Cofounder of Hugging Face

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

ShowUI is an open-source, end-to-end Vision-Language-Action (VLA) model designed for GUI agents and computer use. It enables models to understand and interact with graphical user interfaces, facilitating tasks like automated computer operation and complex workflow execution. The project targets researchers and developers building AI agents capable of performing actions on desktop applications and web interfaces.

How It Works

ShowUI employs a VLA architecture that processes visual inputs (screenshots) and textual instructions to generate actionable outputs (e.g., mouse clicks, text input). It leverages large language models fine-tuned on GUI-specific datasets, allowing for nuanced understanding of UI elements and user intent. The model supports iterative refinement for improved grounding accuracy and can be integrated with tools like vllm for efficient inference.

Quick Start & Requirements

Install/Run: Local Gradio installation (see GRADIO.md), vllm inference (inference_vllm.ipynb), or via OOTB for computer control.
Prerequisites: Python, potentially CUDA for vllm. No GPU required for Gradio demo/API calling.
Resources: Supports int8 quantization.
Links: Paper, Hugging Face Demo, OOTB, Quick Start.

Highlighted Details

Accepted to CVPR 2025.
Received Outstanding Paper Award at NeurIPS2024 Open-World Agents workshop.
Supports fine-tuning and inference with Qwen2.5-VL and vllm.
Enables API calling via Gradio Client and local computer control via OOTB integration.

Maintenance & Community

The project is actively maintained with frequent updates, including support for new models and inference backends. Community engagement is encouraged via GitHub stars and an active X (Twitter) presence.

Licensing & Compatibility

The project is open-source, with specific dataset licenses and model weights potentially varying. Compatibility for commercial use or closed-source linking should be verified against individual component licenses.

Limitations & Caveats

While the Gradio demo and API calling do not require a GPU, efficient inference and training likely benefit from or require GPU acceleration. The project is under active development, with potential for breaking changes as new features are integrated.

Health Check

Last Commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

55 stars in the last 30 days