Aria-UI by AriaUI

GUI agent for context-aware action grounding from instructions

Created 1 year ago

393 stars

Top 73.2% on SourcePulse

Project Summary

Aria-UI is an open-source project providing fast, context-aware action grounding for GUI/computer-use agents. It translates natural language instructions into precise pixel coordinates on graphical user interfaces, enabling agents to interact with software. The project targets developers building autonomous agents for tasks like UI automation, testing, and assistive technologies.

How It Works

Aria-UI employs a mixture-of-expert (MoE) architecture with 3.9B activated parameters per token. It processes variable-sized GUI inputs, including interleaved text and images, to understand instructions contextually. This approach allows for efficient encoding of visual information and leverages historical context to improve grounding accuracy, leading to state-of-the-art performance on agent benchmarks.

Quick Start & Requirements

Install via pip: pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
For enhanced performance, install flash-attn and optionally grouped_gemm.
vLLM inference is strongly recommended: pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl (with VLLM_COMMIT set).
Requires PyTorch and a CUDA-enabled GPU for optimal performance.
Official Demo: https://huggingface.co/spaces/Aria-UI/Aria-UI
Project Page: https://ariaui.github.io

Highlighted Details

Achieves 1st place on AndroidWorld (44.8% task success rate) and 3rd place on OSWorld (15.2% task success rate) as of Dec 2024.
Supports context-aware grounding by leveraging historical input (text or text-image interleaved).
Handles ultra-resolution GUI inputs with variable sizes and aspect ratios.
Released context-aware models and datasets with ~992K instruction-output pairs.

Maintenance & Community

Active development with recent releases in Jan-Feb 2025.
Models and datasets available on Hugging Face and ModelScope.
Paper available on arXiv: https://arxiv.org/abs/2412.16256

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The README does not specify licensing, which may impact commercial use or closed-source integration.
Inference with the base Transformers library is noted as "not recommended" compared to vLLM.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

3 stars in the last 30 days

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

A curated collection for developing advanced GUI agents

Created 9 months ago

Updated 1 day ago

UGround by OSU-NLP-Group

GUI visual grounding for GUI agents

Created 1 year ago

Updated 5 months ago

SeeClick by njucckevin

Visual GUI agent for grounding and interacting with graphical user interfaces

Created 2 years ago

Updated 6 months ago

ScreenSpot-Pro-GUI-Grounding by likaixin2000

GUI grounding for professional high-resolution computer interaction

Created 1 year ago

Updated 4 days ago

GUI-Actor by microsoft

Coordinate-free visual grounding for GUI agents

Created 7 months ago

Updated 2 months ago

Starred by

Harrison Chase

Harrison Chase(Founder of LangChain).

langchain-code by zamalali

Unified AI coding assistant CLI

Created 4 months ago

Updated 1 month ago

Awesome-GUI-Agent by showlab

GUI agent resource list

Created 1 year ago

Updated 4 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

CogAgent by zai-org

VLM-based GUI agent for automating graphical user interfaces

Created 2 years ago

Updated 9 months ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Travis Fischer

Travis Fischer(Founder of Agentic), and

2 more.

acu by trycua

Curated list of AI agents for computer use, frameworks, and tools

Created 1 year ago

Updated 3 months ago

Starred by

Max Liu

Max Liu(Cofounder of PingCAP),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

1 more.

ShowUI by showlab

Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)

Created 1 year ago

Updated 7 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic) and

Jianwei Yang

Jianwei Yang(Research Scientist at Meta Superintelligence Lab).

Magma by microsoft

Multimodal AI agent foundation model research paper

Created 1 year ago

Updated 3 months ago

fara by microsoft

Agentic model for visual computer task automation

Created 2 months ago

Updated 3 weeks ago

Feedback? Help us improve.