Multimodal AI agent for web UI interaction
Top 36.6% on sourcepulse
This project provides an AI agent that leverages GPT-4V(ision) to interact with web UIs using mouse and keyboard actions. It aims to automate workflows, improve UI accessibility, and enable automated UI testing by mirroring human interaction patterns.
How It Works
The agent combines GPT-4V(ision) with a "Set-of-Mark Prompting" technique and a custom auto-labeler. The auto-labeler assigns unique numerical IDs to interactable UI elements. By processing a task description and a screenshot, GPT-4V can determine the necessary action and use the element IDs to specify precise coordinates for mouse clicks or typing.
Quick Start & Requirements
npm install
npm start
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project currently has partial support for typing special keycodes and scrolling. Features like AI auto-labeling, remembering information, and prompting the user for more information are not yet implemented.
7 months ago
Inactive