GPT-4V-Act  by ddupont808

Multimodal AI agent for web UI interaction

created 1 year ago
1,046 stars

Top 36.6% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an AI agent that leverages GPT-4V(ision) to interact with web UIs using mouse and keyboard actions. It aims to automate workflows, improve UI accessibility, and enable automated UI testing by mirroring human interaction patterns.

How It Works

The agent combines GPT-4V(ision) with a "Set-of-Mark Prompting" technique and a custom auto-labeler. The auto-labeler assigns unique numerical IDs to interactable UI elements. By processing a task description and a screenshot, GPT-4V can determine the necessary action and use the element IDs to specify precise coordinates for mouse clicks or typing.

Quick Start & Requirements

  • Install dependencies: npm install
  • Start the demo: npm start
  • Requires Node.js.

Highlighted Details

  • JS DOM auto-labeler with COCO export.
  • Supports clicking and partial typing (characters, numbers, strings).
  • Demonstrates interaction via JSON-formatted action outputs.

Maintenance & Community

  • Project lead: ddupont808.
  • Contact: ddupont@mit.edu.
  • Mentions a new project, Windows Agent Arena (WAA), incorporating similar features.

Licensing & Compatibility

  • License not specified in the README.

Limitations & Caveats

The project currently has partial support for typing special keycodes and scrolling. Features like AI auto-labeling, remembering information, and prompting the user for more information are not yet implemented.

Health Check
Last commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.