ScreenAgent  by niuzaisheng

Computer control agent driven by visual language model (research paper)

created 1 year ago
481 stars

Top 64.6% on sourcepulse

GitHubView on GitHub
Project Summary

ScreenAgent enables Visual Language Models (VLMs) to control computer GUIs by interpreting screenshots and generating mouse/keyboard actions. It targets researchers and developers building autonomous agents for desktop automation, offering a framework for planning, execution, and reflection to complete multi-step tasks.

How It Works

ScreenAgent employs a "planning-execution-reflection" loop. The VLM breaks down tasks into subtasks, observes screenshots, and outputs precise screen coordinates for GUI interactions. A controller executes these actions via VNC, feeding results back to the VLM for reflection and iterative refinement. This approach is universal, supporting various OS and applications by relying on fundamental GUI operations rather than specific APIs.

Quick Start & Requirements

  • Install/Run: Clone the repository and install dependencies (pip install -r client/requirements.txt).
  • Prerequisites: VNC Server (e.g., TightVNC) on the target desktop, or use the provided Docker image (niuniushan/screenagent-env). Python 3.10+ recommended. A VLM (GPT-4V, LLaVA-1.5, CogAgent, or ScreenAgent) is required, with setup instructions provided for local models. Clipboard service setup is recommended for non-ASCII input.
  • Resources: Running local VLM inferencers (LLaVA, CogAgent) requires significant GPU resources.
  • Docs: ScreenAgent Paper, Web Client

Highlighted Details

  • Supports multiple VLM backends including GPT-4V, LLaVA-1.5, CogAgent, and the project's own ScreenAgent model.
  • Utilizes a custom dataset (ScreenAgent Dataset) for training and evaluation, alongside existing datasets like Rico and Mind2Web.
  • Implements a VNC-based action space for broad compatibility across desktop environments.
  • Features a "planning-execution-reflection" control loop for robust task completion.

Maintenance & Community

The project was accepted to IJCAI 2024. The README does not list specific community channels or active maintainers.

Licensing & Compatibility

  • License: MIT for the code, Apache-2.0 for the dataset.
  • Model License: The CogVLM License applies to the CogAgent and ScreenAgent models.
  • Compatibility: The MIT license permits commercial use and linking with closed-source projects.

Limitations & Caveats

The setup process for local VLM inferencers is complex and resource-intensive. The project is research-oriented, and the "TODO" list indicates areas for future development, such as simplifying the controller and integrating with Gym.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
40 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.