OSWorld by xlang-ai

Multimodal agent benchmark for open-ended tasks in realistic computer environments

Created 2 years ago

2,595 stars

Top 17.7% on SourcePulse

View on GitHub

3 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Thomas Wolf

Cofounder of Hugging Face

Binyuan Hui

Research Scientist at Alibaba Qwen

Project Summary

OSWorld provides a benchmark for multimodal agents to perform open-ended tasks within real computer environments, targeting AI researchers and developers building agents that interact with graphical user interfaces. It enables the evaluation of agent capabilities in realistic desktop and web application scenarios.

How It Works

OSWorld leverages virtual machine technology (VMware, VirtualBox, Docker) to create isolated, reproducible environments that mimic real computer systems. Agents interact with these environments using a combination of visual observations (screenshots) and potentially accessibility tree information, executing actions via simulated mouse and keyboard inputs (e.g., using pyautogui). This approach allows for complex, multi-step task execution and evaluation in a controlled yet realistic setting.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (Python >= 3.9), and install dependencies via pip install -r requirements.txt.
Prerequisites: Requires VMware Workstation Pro (or Fusion for Apple Chips) and configuring vmrun, or Docker with KVM support.
Setup: The setup script automatically downloads and configures necessary virtual machines.
Documentation: Website, Paper, Doc

Highlighted Details

Supports multiple VM providers: VMware, VirtualBox, and Docker.
Offers various observation types: screenshots, accessibility trees, etc.
Includes baseline agents for GPT-4V, Gemini-ProV, and Claude-3 Opus.
Provides detailed evaluation metrics and results visualization tools.

Maintenance & Community

The project is associated with NeurIPS 2024 and has active development with recent updates supporting Docker and expanding VM provider options. A Discord server is available for community engagement.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

VMware support on macOS may have limitations, and KVM support is generally not available on macOS hosts. Running experiments can be time-consuming and incur costs, especially with powerful models and extensive testing. Residual Docker containers may require manual cleanup.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

95 stars in the last 30 days