AppCopilot  by OpenBMB

Autonomous multimodal mobile agent system

Created 2 months ago
501 stars

Top 62.0% on SourcePulse

GitHubView on GitHub
Project Summary

AppCopilot addresses fundamental challenges in mobile agents: generalization across tasks, accuracy in on-screen interaction, long-horizon task completion, and efficiency on resource-constrained devices. It offers a general-purpose, on-device multimodal assistant capable of operating across applications and devices, benefiting researchers and developers aiming for sophisticated digital assistants.

How It Works

This system employs a multimodal, multi-agent architecture integrating foundation models with robust Chinese-English support. It leverages chain-of-thought reasoning, hierarchical task planning, and multi-agent collaboration for complex goal execution. The closed-loop system spans data collection, training, deployment, and efficient inference, with profiling-driven optimization for latency, memory, and energy across heterogeneous hardware.

Quick Start & Requirements

  • Installation: Clone the repository. Install Android Studio from its official website. Set up Python (3.12 recommended) and Conda (download from anaconda.org/anaconda/conda). Install vLLM (0.9.1) via pip. Configure ADB and emulator environment variables. Install YADB by cloning it into ./YADB.
  • Server Setup: Deploy pre-trained models (GUI model, Qwen-VL-7B) using vLLM (documentation at docs.vllm.ai/en/latest/) on a server, exposing APIs on ports 8001 and 8002.
  • Local Setup: Forward server ports (8001, 8002) to the local machine using SSH. Install local Python dependencies (pip install -r requirements.txt). Configure API keys and endpoints in ./wrappers/constants.py.
  • Execution: Run via run_agent.py (single-device) or cross_device_agent.py (multi-device).
  • Prerequisites: OS supporting Android Studio, Android Studio, Python 3.12, Conda, vLLM, YADB, ADB, emulator, and potentially GPU for server-side inference.

Highlighted Details

  • Multi-Device Orchestration: Demonstrates sophisticated coordination across two and three devices for tasks like gift purchasing based on viewing history, involving cross-app decision-making and preference extraction.
  • Long-Horizon Task Completion: Successfully executes complex, multi-step tasks such as sequential filtering and sorting in mobile applications.
  • Distributed Intelligence: Enables multi-user collaborative operations, moving towards a system-level architecture with distributed state modeling and autonomous agent collaboration despite information gaps.
  • Content Understanding: Infers user intent and demand from raw data (e.g., video keywords) to guide product search and selection.

Maintenance & Community

Primary information source is the arXiv preprint arXiv:2509.02444. Contact email: qianc@sjtu.edu.cn. No community channels (Discord/Slack) are listed.

Licensing & Compatibility

The README does not specify a software license. Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The setup requires a dedicated server environment for vLLM model serving and complex network configurations (port forwarding), potentially posing an adoption barrier. The project is presented as a research artifact with a focus on demonstrating capabilities rather than a production-ready, easily deployable library.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
326 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.