page-eyes-agent by tencentmusic

Lightweight UI automation agent driven by natural language

Created 5 months ago

881 stars

Top 40.6% on SourcePulse

Project Summary

This project provides a lightweight UI automation agent, PageEyes Agent, designed to execute Web and Android UI tasks through natural language commands, eliminating the need for manual scripting. It targets engineers and testers seeking efficient, script-free automation for tasks like testing and UI inspection across different platforms. The primary benefit is enabling complex UI interactions via simple text instructions, streamlining the automation workflow.

How It Works

PageEyes Agent is built on the Pydantic AI framework, leveraging the OmniParserV2 model for element information perception. A key design advantage is its independence from large visual language models, allowing it to function effectively with smaller LLMs for path planning. This approach enhances flexibility and reduces computational overhead. The agent supports cross-platform execution on Web and Android, with future plans for iOS. It integrates with various LLMs, defaulting to DeepSeek V3, and enables natural language assertions with detailed execution logging and reporting.

Quick Start & Requirements

Installation: Install via pip: pip install page-eyes. Alternatively, clone the repository and run uv sync to install dependencies.
Prerequisites: Python environment. Configuration of the OmniParser service is required. Environment variables must be set for LLM model selection (AGENT_MODEL), API keys (OPENAI_API_KEY), API endpoint (OPENAI_BASE_URL), and optional settings like debug mode (AGENT_DEBUG) or headless operation (AGENT_HEADLESS). Cloud storage (Tencent Cloud COS or MinIO) credentials (COS_SECRET_ID, MINIO_ENDPOINT, etc.) are necessary for logging and reporting.
Links: Usage examples are provided within the documentation.

Highlighted Details

Fully driven by natural language instructions, requiring no manual scripting.
Supports cross-platform automation for Web and Android, with iOS planned.
Integrates with multiple LLM providers, including DeepSeek and OpenAI.
Facilitates natural language assertions and generates comprehensive execution logs and reports.

Maintenance & Community

The project outlines a contribution guide and encourages users to submit issues for feature ideas or bug reports. A community group is mentioned for further discussion. Specific details regarding maintainers, sponsorships, or a public roadmap beyond planned platform support are not detailed in the provided text.

Licensing & Compatibility

The provided documentation does not explicitly state the software license. This omission requires clarification regarding usage rights, particularly for commercial applications or integration into closed-source projects.

Limitations & Caveats

Currently, the agent supports Web and Android platforms, with iOS support pending. It necessitates configuration of external services like OmniParser and cloud storage, along with valid API keys for the chosen LLM, which may incur costs. The absence of a specified license is a significant adoption blocker.

page-eyes-agent by tencentmusic

Explore Similar Projects

openator by agentlabs-dev

omnichain by zenoverflow

awesome-gemini-cli by Piebald-AI

agentchain by jina-ai

neuralagent by withneural

workflow-builder-template by vercel-labs

quark-engine by ev-flow

workany by workany-ai

autoMate by yuruotong1

midscene by web-infra-dev

ANUS by anus-dev

trae-agent by bytedance