AppAgent by TencentQQGYLab

LLM-based multimodal agent for smartphone app operation

Created 2 years ago

6,441 stars

Top 7.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Zack Li

Cofounder of Nexa AI

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

AppAgent is an LLM-based multimodal agent framework that enables AI to operate smartphone applications by mimicking human touch and swipe interactions. It targets researchers and developers looking to build sophisticated mobile automation tools, offering a novel approach that bypasses the need for direct app backend access.

How It Works

AppAgent utilizes a two-phase process: exploration and deployment. During exploration, the agent autonomously navigates an app or learns from human demonstrations to build a knowledge base of UI elements and their functions. This knowledge base, consisting of documented interactions, is then used by the agent in the deployment phase to execute user-defined tasks on the smartphone. This method allows for generalization across different apps and tasks without requiring explicit API integrations.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Android Debug Bridge (adb), Android device with USB debugging enabled or Android Studio emulator, Python 3.
Configuration: Requires OpenAI API key (GPT-4V) or Alibaba Cloud Dashscope API key (Qwen-VL-Max). GPT-4V usage incurs costs.
Resources: Setup involves cloning the repo, installing dependencies, and configuring API keys.
Docs: Quick Start, Demo

Highlighted Details

Supports autonomous exploration or learning from human demonstrations.
Offers a grid overlay for precise UI element interaction.
Includes an evaluation benchmark for performance assessment.
Recently released AppAgentX with an evolving mechanism.

Maintenance & Community

The project is actively maintained by TencentQQGYLab, with recent updates including AppAgentX and alternative model support. Contact is available via GitHub Issues or email for support.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Performance is dependent on the chosen multimodal model (GPT-4V recommended over Qwen-VL-Max). Manual revision of generated documentation may be necessary for optimal performance.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

90 stars in the last 30 days