GPT-4V web browser using Vimium
Top 18.1% on sourcepulse
This project enables multimodal LLMs, specifically GPT-4V, to browse the web by leveraging the Vimium Chrome extension for keyboard-based navigation. It targets developers and researchers interested in AI-driven web interaction, offering a novel approach to overcome the limitations of providing DOM text to vision-only models.
How It Works
The core idea is to combine GPT-4V's visual understanding with Vimium's keyboard shortcuts. GPT-4V analyzes screenshots of web pages, identifying elements to interact with. Vimium then translates these identified elements into keyboard commands (e.g., pressing 'f' to reveal clickable links and then a specific key to activate a link). This method allows the model to interact with web elements without needing a textual DOM representation, relying solely on visual cues.
Quick Start & Requirements
pip install -r requirements.txt
python main.py
python main.py --voice
Highlighted Details
Maintenance & Community
The project is maintained by ishan0102. Shoutouts and references to HackerNews, VisualWebArena, WIRED, globot, and nat/natbot indicate community interest and related work.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or closed-source integration.
Limitations & Caveats
The project is experimental and may encounter issues with low-resolution images or complex web page layouts. Future work is planned to address limitations with the Vision API's lack of JSON mode and function calling.
10 months ago
Inactive