vimGPT  by ishan0102

GPT-4V web browser using Vimium

created 1 year ago
2,668 stars

Top 18.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project enables multimodal LLMs, specifically GPT-4V, to browse the web by leveraging the Vimium Chrome extension for keyboard-based navigation. It targets developers and researchers interested in AI-driven web interaction, offering a novel approach to overcome the limitations of providing DOM text to vision-only models.

How It Works

The core idea is to combine GPT-4V's visual understanding with Vimium's keyboard shortcuts. GPT-4V analyzes screenshots of web pages, identifying elements to interact with. Vimium then translates these identified elements into keyboard commands (e.g., pressing 'f' to reveal clickable links and then a specific key to activate a link). This method allows the model to interact with web elements without needing a textual DOM representation, relying solely on visual cues.

Quick Start & Requirements

  • Install Python requirements: pip install -r requirements.txt
  • Download Vimium locally and load it manually in Playwright.
  • Run the script: python main.py
  • Voice Mode: python main.py --voice
  • Requires Python 3.x, Chrome browser, and Playwright.

Highlighted Details

  • Utilizes GPT-4V's vision capabilities for web browsing.
  • Integrates with the Vimium Chrome extension for keyboard navigation.
  • Supports voice commands for hands-free interaction.
  • Explores using accessibility trees and visual DOM element labeling for improved interaction.

Maintenance & Community

The project is maintained by ishan0102. Shoutouts and references to HackerNews, VisualWebArena, WIRED, globot, and nat/natbot indicate community interest and related work.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or closed-source integration.

Limitations & Caveats

The project is experimental and may encounter issues with low-resolution images or complex web page layouts. Future work is planned to address limitations with the Vision API's lack of JSON mode and function calling.

Health Check
Last commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.