Aria-UI  by AriaUI

GUI agent for context-aware action grounding from instructions

created 7 months ago
373 stars

Top 77.1% on sourcepulse

GitHubView on GitHub
Project Summary

Aria-UI is an open-source project providing fast, context-aware action grounding for GUI/computer-use agents. It translates natural language instructions into precise pixel coordinates on graphical user interfaces, enabling agents to interact with software. The project targets developers building autonomous agents for tasks like UI automation, testing, and assistive technologies.

How It Works

Aria-UI employs a mixture-of-expert (MoE) architecture with 3.9B activated parameters per token. It processes variable-sized GUI inputs, including interleaved text and images, to understand instructions contextually. This approach allows for efficient encoding of visual information and leverages historical context to improve grounding accuracy, leading to state-of-the-art performance on agent benchmarks.

Quick Start & Requirements

  • Install via pip: pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
  • For enhanced performance, install flash-attn and optionally grouped_gemm.
  • vLLM inference is strongly recommended: pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl (with VLLM_COMMIT set).
  • Requires PyTorch and a CUDA-enabled GPU for optimal performance.
  • Official Demo: https://huggingface.co/spaces/Aria-UI/Aria-UI
  • Project Page: https://ariaui.github.io

Highlighted Details

  • Achieves 1st place on AndroidWorld (44.8% task success rate) and 3rd place on OSWorld (15.2% task success rate) as of Dec 2024.
  • Supports context-aware grounding by leveraging historical input (text or text-image interleaved).
  • Handles ultra-resolution GUI inputs with variable sizes and aspect ratios.
  • Released context-aware models and datasets with ~992K instruction-output pairs.

Maintenance & Community

  • Active development with recent releases in Jan-Feb 2025.
  • Models and datasets available on Hugging Face and ModelScope.
  • Paper available on arXiv: https://arxiv.org/abs/2412.16256

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • The README does not specify licensing, which may impact commercial use or closed-source integration.
  • Inference with the base Transformers library is noted as "not recommended" compared to vLLM.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

CogVLM by zai-org

0.1%
7k
VLM for image understanding and multi-turn dialogue
created 1 year ago
updated 1 year ago
Feedback? Help us improve.