Browser automation via GPT-4 Vision
Top 56.1% on sourcepulse
This project provides a browser automation tool that leverages GPT-4 Vision to interpret user actions and generate automation scripts. It targets developers and power users seeking to create complex browser workflows through intuitive, human-like instruction, aiming to simplify and enhance web automation tasks.
How It Works
The core innovation addresses element selection by indexing the entire DOM in MeiliSearch. GPT-4 Vision generates commands (e.g., "click this text"), which are then used to query the MeiliSearch index for the corresponding element ID. This approach aims for greater reliability than methods relying solely on visual coordinates or raw HTML. For workflow adherence, it employs an "Actions Augmented Generation" technique, embedding recorded DOM element changes from user actions within prompts to keep GPT focused on the task.
Quick Start & Requirements
firebaseAdmin/cert/dev.json
or prod.json
), .env
file setup, npm install
, npm run db:deploy
, and npm run dev
(development) or npm run build
& npm run start
(production)../client/extension/build
.Highlighted Details
Maintenance & Community
The project is maintained by vignshwarar. Further community or roadmap details are not explicitly provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is undetermined.
Limitations & Caveats
The project is in active development, with features like scrolling, opening new tabs, and loop support still on the roadmap. Handling icons and duplicate text elements are noted as ongoing challenges.
1 year ago
1 day