vimGPT  by ishan0102

GPT-4V web browser using Vimium

Created 1 year ago
2,669 stars

Top 17.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project enables multimodal LLMs, specifically GPT-4V, to browse the web by leveraging the Vimium Chrome extension for keyboard-based navigation. It targets developers and researchers interested in AI-driven web interaction, offering a novel approach to overcome the limitations of providing DOM text to vision-only models.

How It Works

The core idea is to combine GPT-4V's visual understanding with Vimium's keyboard shortcuts. GPT-4V analyzes screenshots of web pages, identifying elements to interact with. Vimium then translates these identified elements into keyboard commands (e.g., pressing 'f' to reveal clickable links and then a specific key to activate a link). This method allows the model to interact with web elements without needing a textual DOM representation, relying solely on visual cues.

Quick Start & Requirements

  • Install Python requirements: pip install -r requirements.txt
  • Download Vimium locally and load it manually in Playwright.
  • Run the script: python main.py
  • Voice Mode: python main.py --voice
  • Requires Python 3.x, Chrome browser, and Playwright.

Highlighted Details

  • Utilizes GPT-4V's vision capabilities for web browsing.
  • Integrates with the Vimium Chrome extension for keyboard navigation.
  • Supports voice commands for hands-free interaction.
  • Explores using accessibility trees and visual DOM element labeling for improved interaction.

Maintenance & Community

The project is maintained by ishan0102. Shoutouts and references to HackerNews, VisualWebArena, WIRED, globot, and nat/natbot indicate community interest and related work.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or closed-source integration.

Limitations & Caveats

The project is experimental and may encounter issues with low-resolution images or complex web page layouts. Future work is planned to address limitations with the Vision API's lack of JSON mode and function calling.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Mckay Wrigley Mckay Wrigley(Founder of Takeoff AI), and
1 more.

chatGPTBox by ChatGPTBox-dev

0.1%
11k
Browser extension for ChatGPT integration
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.