self-operating-computer by OthersideAI

Framework for multimodal computer operation

Created 2 years ago

10,085 stars

Top 5.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jinze Bai

Research Scientist at Alibaba Qwen

Gabriel Almeida

Cofounder of Langflow

Rodrigo Nader

Cofounder of Langflow

and 2 more!

Project Summary

This framework enables multimodal AI models to operate a computer by interpreting screen content and executing keyboard and mouse actions, targeting AI researchers and developers seeking to automate complex digital tasks. It offers a novel approach to human-computer interaction by abstracting the user interface into a format accessible to large language models.

How It Works

The system functions by feeding screen captures to a chosen multimodal model (e.g., GPT-4o, Gemini Pro Vision, Claude 3). The model analyzes the visual input and generates a sequence of actions, such as mouse clicks or keystrokes, to achieve a user-defined objective. This approach leverages the advanced reasoning and visual understanding capabilities of LLMs to perform tasks that typically require human interaction.

Quick Start & Requirements

Install via pip: pip install self-operating-computer
Requires an API key for the chosen multimodal model (e.g., OpenAI, Google AI Studio).
For LLaVA integration, Ollama must be installed and the LLaVA model pulled (ollama pull llava).
Mac users may need brew install portaudio; Linux users sudo apt install portaudio19-dev python3-pyaudio for voice mode.
OpenAI's gpt-4o model requires prior API spending of at least $5 to unlock.
Official Demo: https://github.com/OthersideAI/self-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0

Highlighted Details

Supports multiple multimodal models including GPT-4o, o1, Gemini Pro Vision, Claude 3, Qwen-VL, and LLaVA.
Integrates Optical Character Recognition (OCR) for enhanced element identification and interaction.
Features Set-of-Mark (SoM) prompting with a YOLOv8 model for improved visual grounding.
Offers voice input capabilities for hands-free operation.

Maintenance & Community

Active community support via Discord.
Contributions are welcomed via pull requests.
Updates and developments can be followed on Twitter and LinkedIn via @HyperWriteAI.

Licensing & Compatibility

The project appears to be open-source, but a specific license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require license clarification.
Compatible with Mac OS, Windows, and Linux (with X server).

Limitations & Caveats

LLaVA integration via Ollama is noted to have very high error rates and is considered experimental. The gpt-4o model has an API access requirement of $5 minimum spend.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

97 stars in the last 30 days