ShowUI  by showlab

Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)

created 9 months ago
1,398 stars

Top 29.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

ShowUI is an open-source, end-to-end Vision-Language-Action (VLA) model designed for GUI agents and computer use. It enables models to understand and interact with graphical user interfaces, facilitating tasks like automated computer operation and complex workflow execution. The project targets researchers and developers building AI agents capable of performing actions on desktop applications and web interfaces.

How It Works

ShowUI employs a VLA architecture that processes visual inputs (screenshots) and textual instructions to generate actionable outputs (e.g., mouse clicks, text input). It leverages large language models fine-tuned on GUI-specific datasets, allowing for nuanced understanding of UI elements and user intent. The model supports iterative refinement for improved grounding accuracy and can be integrated with tools like vllm for efficient inference.

Quick Start & Requirements

  • Install/Run: Local Gradio installation (see GRADIO.md), vllm inference (inference_vllm.ipynb), or via OOTB for computer control.
  • Prerequisites: Python, potentially CUDA for vllm. No GPU required for Gradio demo/API calling.
  • Resources: Supports int8 quantization.
  • Links: Paper, Hugging Face Demo, OOTB, Quick Start.

Highlighted Details

  • Accepted to CVPR 2025.
  • Received Outstanding Paper Award at NeurIPS2024 Open-World Agents workshop.
  • Supports fine-tuning and inference with Qwen2.5-VL and vllm.
  • Enables API calling via Gradio Client and local computer control via OOTB integration.

Maintenance & Community

The project is actively maintained with frequent updates, including support for new models and inference backends. Community engagement is encouraged via GitHub stars and an active X (Twitter) presence.

Licensing & Compatibility

The project is open-source, with specific dataset licenses and model weights potentially varying. Compatibility for commercial use or closed-source linking should be verified against individual component licenses.

Limitations & Caveats

While the Gradio demo and API calling do not require a GPU, efficient inference and training likely benefit from or require GPU acceleration. The project is under active development, with potential for breaking changes as new features are integrated.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
188 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Didier Lopes Didier Lopes(Founder of OpenBB), and
10 more.

JARVIS by microsoft

0.1%
24k
System for LLM-orchestrated AI task automation
created 2 years ago
updated 4 days ago
Feedback? Help us improve.