ScreenAgent by niuzaisheng

Computer control agent driven by visual language model (research paper)

Created 2 years ago

553 stars

Top 57.8% on SourcePulse

Project Summary

ScreenAgent enables Visual Language Models (VLMs) to control computer GUIs by interpreting screenshots and generating mouse/keyboard actions. It targets researchers and developers building autonomous agents for desktop automation, offering a framework for planning, execution, and reflection to complete multi-step tasks.

How It Works

ScreenAgent employs a "planning-execution-reflection" loop. The VLM breaks down tasks into subtasks, observes screenshots, and outputs precise screen coordinates for GUI interactions. A controller executes these actions via VNC, feeding results back to the VLM for reflection and iterative refinement. This approach is universal, supporting various OS and applications by relying on fundamental GUI operations rather than specific APIs.

Quick Start & Requirements

Install/Run: Clone the repository and install dependencies (pip install -r client/requirements.txt).
Prerequisites: VNC Server (e.g., TightVNC) on the target desktop, or use the provided Docker image (niuniushan/screenagent-env). Python 3.10+ recommended. A VLM (GPT-4V, LLaVA-1.5, CogAgent, or ScreenAgent) is required, with setup instructions provided for local models. Clipboard service setup is recommended for non-ASCII input.
Resources: Running local VLM inferencers (LLaVA, CogAgent) requires significant GPU resources.
Docs: ScreenAgent Paper, Web Client

Highlighted Details

Supports multiple VLM backends including GPT-4V, LLaVA-1.5, CogAgent, and the project's own ScreenAgent model.
Utilizes a custom dataset (ScreenAgent Dataset) for training and evaluation, alongside existing datasets like Rico and Mind2Web.
Implements a VNC-based action space for broad compatibility across desktop environments.
Features a "planning-execution-reflection" control loop for robust task completion.

Maintenance & Community

The project was accepted to IJCAI 2024. The README does not list specific community channels or active maintainers.

Licensing & Compatibility

License: MIT for the code, Apache-2.0 for the dataset.
Model License: The CogVLM License applies to the CogAgent and ScreenAgent models.
Compatibility: The MIT license permits commercial use and linking with closed-source projects.

Limitations & Caveats

The setup process for local VLM inferencers is complex and resource-intensive. The project is research-oriented, and the "TODO" list indicates areas for future development, such as simplifying the controller and integrating with Gym.

ScreenAgent by niuzaisheng

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

PC-Agent by GAIR-NLP

GUI-Agents-Paper-List by OSU-NLP-Group

Aria-UI by AriaUI

Clevrr-Computer by Clevrr-AI

computer-agent by suitedaces

Peekaboo by steipete

CogAgent by zai-org

OS-Copilot by OS-Copilot

fara by microsoft

UFO by microsoft

UI-TARS-desktop by bytedance