Open-Interface by AmberSahdev

LLM-powered tool to control computers via simulated input

Created 1 year ago

2,512 stars

Top 18.3% on SourcePulse

Project Summary

This project provides a self-driving interface for computers, enabling users to control their machines using Large Language Models (LLMs) like GPT-4o or Gemini. It's designed for users who want to automate complex tasks or interact with their computer through natural language commands, offering an "autopilot" experience across macOS, Linux, and Windows.

How It Works

The core approach involves sending user requests to an LLM backend, which breaks down the task into executable steps. The application then simulates keyboard and mouse inputs to perform these actions. To ensure accuracy and adapt to dynamic interfaces, it captures screenshots of the current progress and feeds them back to the LLM for course correction, creating a feedback loop for task completion.

Quick Start & Requirements

Install: Download pre-compiled binaries for macOS, Linux (Ubuntu 20.04 tested), or Windows (Windows 10 tested) from the latest release. Alternatively, clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Requires an OpenAI API key (with a minimum $5 pre-paid balance for GPT-4o) or a Google Gemini API key. Custom LLMs with OpenAI-compatible APIs are also supported.
Setup: Configure API keys via the application's settings.
Links: Latest Release, OpenAI API Keys, Google Gemini API Key.

Highlighted Details

Supports GPT-4o, Gemini, and custom OpenAI-compatible LLMs.
Automates tasks by simulating keyboard and mouse inputs.
Utilizes screenshots for LLM-driven feedback and course correction.
Estimated cost per LLM request: $0.0005 - $0.002.

Maintenance & Community

Active development indicated by recent releases and star count.
Project owner: AmberSahdev.

Licensing & Compatibility

License: MIT.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The system struggles with accurate spatial reasoning, making precise clicking and interaction with tabular data (like spreadsheets) difficult. It also has limitations in navigating complex GUI-rich applications that heavily rely on cursor actions. The tool currently only processes the primary display when multiple monitors are in use.

Open-Interface by AmberSahdev

Explore Similar Projects

nobodywho by nobodywho-ooo

llm.nvim by Kurama622

minimal-chat by fingerthief

reins by ibrahimcetin

clickclickclick by instavm

mcp-client-cli by adhikasp

chatgpt-md by bramses

home-llm by acon96

open-computer-use by e2b-dev

torchchat by pytorch

mcp-go by mark3labs

SillyTavern by SillyTavern