screen.vision by r-muresan

AI-powered interactive screen guidance system

Created 7 months ago

307 stars

Top 87.2% on SourcePulse

Project Summary

Screen.vision provides an AI-powered assistant that guides users through complex on-screen tasks by analyzing their screen in real-time and delivering step-by-step instructions. It targets users needing help with software configuration, setup, or any digital workflow, offering a more intuitive alternative to static tutorials.

How It Works

The system captures the user's screen via browser APIs and employs a multi-model AI approach. A primary LLM (GPT-5.2) generates sequential instructions, while a vision model (Qwen3-VL) detects specific UI elements for precise guidance. Another model (Gemini 3 Flash) verifies task completion by comparing screen states before and after an action. This pipeline ensures users receive concise, actionable steps without information overload, with processing occurring in real-time.

Quick Start & Requirements

Installation: Clone the repository, install frontend dependencies (pnpm install), and backend dependencies (pip install -r requirements.txt).
Prerequisites: Node.js 18+, Python 3.10+, pnpm. Requires API keys for OpenAI (OPENAI_API_KEY) and OpenRouter (OPENROUTER_API_KEY) configured in a .env.local file.
Running Locally: Execute npm run dev to start the Next.js frontend (port 3000) and FastAPI backend (port 8000).
Docs: GitHub Repository

Highlighted Details

Real-time, interactive screen-sharing assistance for any application.
Employs a sophisticated AI stack including OpenAI's GPT, Google's Gemini, and Qwen-VL for instruction generation, step verification, and UI element localization.
Prioritizes user privacy with zero data retention; all processing is ephemeral and occurs in real-time.
Frontend utilizes Next.js and React, while the backend is built with FastAPI and Python.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap are provided in the README.

Licensing & Compatibility

The project's license is not specified in the README. This omission prevents an immediate assessment of compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

Self-hosting requires obtaining and managing API keys for external AI services (OpenAI, OpenRouter), which may incur costs. The lack of a specified license is a significant blocker for determining usage rights. The system's effectiveness is dependent on the AI models' ability to accurately interpret diverse UI elements and user actions.

screen.vision by r-muresan

Explore Similar Projects

midscene-skills by web-infra-dev

ScreenAgent by niuzaisheng

tiptour-macos by milind-soni

interview-coder-cn by ooboqoo

sightflow-desktop-agent by sightflow-dev

PhoneDriver by OminousIndustries

Peekaboo by openclaw

Everywhere by Sylinko

autoMate by yuruotong1

clicky by farzaa

AutoJs6 by SuperMonster003

OmniParser by microsoft