Computer control agent driven by visual language model (research paper)
Top 64.6% on sourcepulse
ScreenAgent enables Visual Language Models (VLMs) to control computer GUIs by interpreting screenshots and generating mouse/keyboard actions. It targets researchers and developers building autonomous agents for desktop automation, offering a framework for planning, execution, and reflection to complete multi-step tasks.
How It Works
ScreenAgent employs a "planning-execution-reflection" loop. The VLM breaks down tasks into subtasks, observes screenshots, and outputs precise screen coordinates for GUI interactions. A controller executes these actions via VNC, feeding results back to the VLM for reflection and iterative refinement. This approach is universal, supporting various OS and applications by relying on fundamental GUI operations rather than specific APIs.
Quick Start & Requirements
pip install -r client/requirements.txt
).niuniushan/screenagent-env
). Python 3.10+ recommended. A VLM (GPT-4V, LLaVA-1.5, CogAgent, or ScreenAgent) is required, with setup instructions provided for local models. Clipboard service setup is recommended for non-ASCII input.Highlighted Details
Maintenance & Community
The project was accepted to IJCAI 2024. The README does not list specific community channels or active maintainers.
Licensing & Compatibility
Limitations & Caveats
The setup process for local VLM inferencers is complex and resource-intensive. The project is research-oriented, and the "TODO" list indicates areas for future development, such as simplifying the controller and integrating with Gym.
8 months ago
1 day