self-operating-computer  by OthersideAI

Framework for multimodal computer operation

created 1 year ago
9,812 stars

Top 5.2% on sourcepulse

GitHubView on GitHub
Project Summary

This framework enables multimodal AI models to operate a computer by interpreting screen content and executing keyboard and mouse actions, targeting AI researchers and developers seeking to automate complex digital tasks. It offers a novel approach to human-computer interaction by abstracting the user interface into a format accessible to large language models.

How It Works

The system functions by feeding screen captures to a chosen multimodal model (e.g., GPT-4o, Gemini Pro Vision, Claude 3). The model analyzes the visual input and generates a sequence of actions, such as mouse clicks or keystrokes, to achieve a user-defined objective. This approach leverages the advanced reasoning and visual understanding capabilities of LLMs to perform tasks that typically require human interaction.

Quick Start & Requirements

  • Install via pip: pip install self-operating-computer
  • Requires an API key for the chosen multimodal model (e.g., OpenAI, Google AI Studio).
  • For LLaVA integration, Ollama must be installed and the LLaVA model pulled (ollama pull llava).
  • Mac users may need brew install portaudio; Linux users sudo apt install portaudio19-dev python3-pyaudio for voice mode.
  • OpenAI's gpt-4o model requires prior API spending of at least $5 to unlock.
  • Official Demo: https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0

Highlighted Details

  • Supports multiple multimodal models including GPT-4o, o1, Gemini Pro Vision, Claude 3, Qwen-VL, and LLaVA.
  • Integrates Optical Character Recognition (OCR) for enhanced element identification and interaction.
  • Features Set-of-Mark (SoM) prompting with a YOLOv8 model for improved visual grounding.
  • Offers voice input capabilities for hands-free operation.

Maintenance & Community

  • Active community support via Discord.
  • Contributions are welcomed via pull requests.
  • Updates and developments can be followed on Twitter and LinkedIn via @HyperWriteAI.

Licensing & Compatibility

  • The project appears to be open-source, but a specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require license clarification.
  • Compatible with Mac OS, Windows, and Linux (with X server).

Limitations & Caveats

LLaVA integration via Ollama is noted to have very high error rates and is considered experimental. The gpt-4o model has an API access requirement of $5 minimum spend.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
2
Star History
241 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.