gemini-skill by WJZ-P

Gemini web automation for generative AI

Created 4 months ago

828 stars

Top 42.1% on SourcePulse

Project Summary

This project provides a Node.js-based solution for programmatically interacting with Gemini's web interface, enabling AI-driven image generation, text conversations, and image extraction. It targets AI agents and developers who need to integrate Gemini's capabilities into their workflows via the MCP (Meta-Communication Protocol) standard, offering automated image processing and conversational AI control.

How It Works

The core architecture utilizes a Daemon mode, managing a persistent browser instance connected via the Chrome DevTools Protocol (CDP). This Daemon is automatically launched on demand and includes stealth plugins to bypass anti-bot detection. Responsibilities are separated across an MCP server for protocol handling, Gemini operation logic, a browser connector, and the Daemon for process management. This design allows for efficient reuse of the browser instance, with a 30-minute inactivity timeout before automatic shutdown, and ensures that the browser is launched only when needed.

Quick Start & Requirements

Prerequisites: Node.js version 18 or higher, a compatible browser (Chrome, Edge, Chromium) installed and logged into a Google account for Gemini access.
Installation: Clone the repository (git clone), navigate into the directory (cd gemini-skill), and install dependencies (npm install).
Configuration: Environment variables or a .env file in the project root can configure browser paths, headless mode, ports, and output directories.
Running: The primary method is npm run mcp to start the MCP server. Alternatively, npm run daemon starts only the Daemon, or npm run demo executes example usage.
Documentation: The README serves as the primary documentation.

Highlighted Details

AI Image Generation: Generates images from text prompts, supporting high-resolution downloads and the use of reference images.
Image Extraction & Watermark Removal: Extracts images from Gemini conversations and automatically removes watermarks from downloaded images.
MCP Server: Exposes a standard MCP interface, allowing seamless integration with any MCP-compatible AI agent.
Session Management: Facilitates creating new sessions, switching between Gemini models (pro, quick, think), and navigating to historical conversations.

Maintenance & Community

The project includes a "To Do List" indicating ongoing development, with planned features like multi-browser instance support and video/music generation. No specific details on maintainers, sponsorships, or community channels (like Discord or Slack) are provided in the README.

Licensing & Compatibility

The project is licensed under the MIT License, which is permissive for commercial use and integration. It explicitly mentions support for the LINUX DO community.

Limitations & Caveats

Initial setup requires a manual Google account login within the launched browser instance. Image generation can be time-consuming (60-120 seconds), necessitating appropriately configured timeouts in client applications. The current implementation does not support running multiple instances concurrently on the same CDP port. Support for music and video generation is pending.

Health Check

Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days