page-eyes-agent  by tencentmusic

Lightweight UI automation agent driven by natural language

Created 3 months ago
643 stars

Top 51.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a lightweight UI automation agent, PageEyes Agent, designed to execute Web and Android UI tasks through natural language commands, eliminating the need for manual scripting. It targets engineers and testers seeking efficient, script-free automation for tasks like testing and UI inspection across different platforms. The primary benefit is enabling complex UI interactions via simple text instructions, streamlining the automation workflow.

How It Works

PageEyes Agent is built on the Pydantic AI framework, leveraging the OmniParserV2 model for element information perception. A key design advantage is its independence from large visual language models, allowing it to function effectively with smaller LLMs for path planning. This approach enhances flexibility and reduces computational overhead. The agent supports cross-platform execution on Web and Android, with future plans for iOS. It integrates with various LLMs, defaulting to DeepSeek V3, and enables natural language assertions with detailed execution logging and reporting.

Quick Start & Requirements

  • Installation: Install via pip: pip install page-eyes. Alternatively, clone the repository and run uv sync to install dependencies.
  • Prerequisites: Python environment. Configuration of the OmniParser service is required. Environment variables must be set for LLM model selection (AGENT_MODEL), API keys (OPENAI_API_KEY), API endpoint (OPENAI_BASE_URL), and optional settings like debug mode (AGENT_DEBUG) or headless operation (AGENT_HEADLESS). Cloud storage (Tencent Cloud COS or MinIO) credentials (COS_SECRET_ID, MINIO_ENDPOINT, etc.) are necessary for logging and reporting.
  • Links: Usage examples are provided within the documentation.

Highlighted Details

  • Fully driven by natural language instructions, requiring no manual scripting.
  • Supports cross-platform automation for Web and Android, with iOS planned.
  • Integrates with multiple LLM providers, including DeepSeek and OpenAI.
  • Facilitates natural language assertions and generates comprehensive execution logs and reports.

Maintenance & Community

The project outlines a contribution guide and encourages users to submit issues for feature ideas or bug reports. A community group is mentioned for further discussion. Specific details regarding maintainers, sponsorships, or a public roadmap beyond planned platform support are not detailed in the provided text.

Licensing & Compatibility

The provided documentation does not explicitly state the software license. This omission requires clarification regarding usage rights, particularly for commercial applications or integration into closed-source projects.

Limitations & Caveats

Currently, the agent supports Web and Android platforms, with iOS support pending. It necessitates configuration of external services like OmniParser and cloud storage, along with valid API keys for the chosen LLM, which may incur costs. The absence of a specified license is a significant adoption blocker.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
235 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Gregor Zunic Gregor Zunic(Cofounder of Browser Use).

droidrun by droidrun

0.8%
7k
Framework for controlling Android devices via LLM agents
Created 9 months ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
5 more.

trae-agent by bytedance

0.7%
10k
LLM-powered CLI for software engineering tasks
Created 7 months ago
Updated 3 months ago
Feedback? Help us improve.