vimGPT by ishan0102

GPT-4V web browser using Vimium

Created 2 years ago

2,671 stars

Top 17.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Travis Fischer

Founder of Agentic

Jianwei Yang

Research Scientist at Meta Superintelligence Lab

Jason Huggins

Creator of Selenium

Project Summary

This project enables multimodal LLMs, specifically GPT-4V, to browse the web by leveraging the Vimium Chrome extension for keyboard-based navigation. It targets developers and researchers interested in AI-driven web interaction, offering a novel approach to overcome the limitations of providing DOM text to vision-only models.

How It Works

The core idea is to combine GPT-4V's visual understanding with Vimium's keyboard shortcuts. GPT-4V analyzes screenshots of web pages, identifying elements to interact with. Vimium then translates these identified elements into keyboard commands (e.g., pressing 'f' to reveal clickable links and then a specific key to activate a link). This method allows the model to interact with web elements without needing a textual DOM representation, relying solely on visual cues.

Quick Start & Requirements

Install Python requirements: pip install -r requirements.txt
Download Vimium locally and load it manually in Playwright.
Run the script: python main.py
Voice Mode: python main.py --voice
Requires Python 3.x, Chrome browser, and Playwright.

Highlighted Details

Utilizes GPT-4V's vision capabilities for web browsing.
Integrates with the Vimium Chrome extension for keyboard navigation.
Supports voice commands for hands-free interaction.
Explores using accessibility trees and visual DOM element labeling for improved interaction.

Maintenance & Community

The project is maintained by ishan0102. Shoutouts and references to HackerNews, VisualWebArena, WIRED, globot, and nat/natbot indicate community interest and related work.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or closed-source integration.

Limitations & Caveats

The project is experimental and may encounter issues with low-resolution images or complex web page layouts. Future work is planned to address limitations with the Vision API's lack of JSON mode and function calling.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days