screen.vision  by r-muresan

AI-powered interactive screen guidance system

Created 2 months ago
296 stars

Top 89.7% on SourcePulse

GitHubView on GitHub
Project Summary

Screen.vision provides an AI-powered assistant that guides users through complex on-screen tasks by analyzing their screen in real-time and delivering step-by-step instructions. It targets users needing help with software configuration, setup, or any digital workflow, offering a more intuitive alternative to static tutorials.

How It Works

The system captures the user's screen via browser APIs and employs a multi-model AI approach. A primary LLM (GPT-5.2) generates sequential instructions, while a vision model (Qwen3-VL) detects specific UI elements for precise guidance. Another model (Gemini 3 Flash) verifies task completion by comparing screen states before and after an action. This pipeline ensures users receive concise, actionable steps without information overload, with processing occurring in real-time.

Quick Start & Requirements

  • Installation: Clone the repository, install frontend dependencies (pnpm install), and backend dependencies (pip install -r requirements.txt).
  • Prerequisites: Node.js 18+, Python 3.10+, pnpm. Requires API keys for OpenAI (OPENAI_API_KEY) and OpenRouter (OPENROUTER_API_KEY) configured in a .env.local file.
  • Running Locally: Execute npm run dev to start the Next.js frontend (port 3000) and FastAPI backend (port 8000).
  • Docs: GitHub Repository

Highlighted Details

  • Real-time, interactive screen-sharing assistance for any application.
  • Employs a sophisticated AI stack including OpenAI's GPT, Google's Gemini, and Qwen-VL for instruction generation, step verification, and UI element localization.
  • Prioritizes user privacy with zero data retention; all processing is ephemeral and occurs in real-time.
  • Frontend utilizes Next.js and React, while the backend is built with FastAPI and Python.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap are provided in the README.

Licensing & Compatibility

The project's license is not specified in the README. This omission prevents an immediate assessment of compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

Self-hosting requires obtaining and managing API keys for external AI services (OpenAI, OpenRouter), which may incur costs. The lack of a specified license is a significant blocker for determining usage rights. The system's effectiveness is dependent on the AI models' ability to accurately interpret diverse UI elements and user actions.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Andrew Ng Andrew Ng(Founder of DeepLearning.AI; Cofounder of Coursera; Professor at Stanford), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

vision-agent by landing-ai

0.1%
5k
Visual AI agent for generating runnable vision code from image/video prompts
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.