GPT-4V-Act by ddupont808

Multimodal AI agent for web UI interaction

Created 2 years ago

1,066 stars

Top 35.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jianwei Yang

Research Scientist at Meta Superintelligence Lab

Project Summary

This project provides an AI agent that leverages GPT-4V(ision) to interact with web UIs using mouse and keyboard actions. It aims to automate workflows, improve UI accessibility, and enable automated UI testing by mirroring human interaction patterns.

How It Works

The agent combines GPT-4V(ision) with a "Set-of-Mark Prompting" technique and a custom auto-labeler. The auto-labeler assigns unique numerical IDs to interactable UI elements. By processing a task description and a screenshot, GPT-4V can determine the necessary action and use the element IDs to specify precise coordinates for mouse clicks or typing.

Quick Start & Requirements

Install dependencies: npm install
Start the demo: npm start
Requires Node.js.

Highlighted Details

JS DOM auto-labeler with COCO export.
Supports clicking and partial typing (characters, numbers, strings).
Demonstrates interaction via JSON-formatted action outputs.

Maintenance & Community

Project lead: ddupont808.
Contact: ddupont@mit.edu.
Mentions a new project, Windows Agent Arena (WAA), incorporating similar features.

Licensing & Compatibility

License not specified in the README.

Limitations & Caveats

The project currently has partial support for typing special keycodes and scrolling. Features like AI auto-labeling, remembering information, and prompting the user for more information are not yet implemented.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days