Open-Interface  by AmberSahdev

LLM-powered tool to control computers via simulated input

created 1 year ago
2,384 stars

Top 19.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a self-driving interface for computers, enabling users to control their machines using Large Language Models (LLMs) like GPT-4o or Gemini. It's designed for users who want to automate complex tasks or interact with their computer through natural language commands, offering an "autopilot" experience across macOS, Linux, and Windows.

How It Works

The core approach involves sending user requests to an LLM backend, which breaks down the task into executable steps. The application then simulates keyboard and mouse inputs to perform these actions. To ensure accuracy and adapt to dynamic interfaces, it captures screenshots of the current progress and feeds them back to the LLM for course correction, creating a feedback loop for task completion.

Quick Start & Requirements

  • Install: Download pre-compiled binaries for macOS, Linux (Ubuntu 20.04 tested), or Windows (Windows 10 tested) from the latest release. Alternatively, clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Requires an OpenAI API key (with a minimum $5 pre-paid balance for GPT-4o) or a Google Gemini API key. Custom LLMs with OpenAI-compatible APIs are also supported.
  • Setup: Configure API keys via the application's settings.
  • Links: Latest Release, OpenAI API Keys, Google Gemini API Key.

Highlighted Details

  • Supports GPT-4o, Gemini, and custom OpenAI-compatible LLMs.
  • Automates tasks by simulating keyboard and mouse inputs.
  • Utilizes screenshots for LLM-driven feedback and course correction.
  • Estimated cost per LLM request: $0.0005 - $0.002.

Maintenance & Community

  • Active development indicated by recent releases and star count.
  • Project owner: AmberSahdev.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

The system struggles with accurate spatial reasoning, making precise clicking and interaction with tabular data (like spreadsheets) difficult. It also has limitations in navigating complex GUI-rich applications that heavily rely on cursor actions. The tool currently only processes the primary display when multiple monitors are in use.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
289 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.