OSWorld  by xlang-ai

Multimodal agent benchmark for open-ended tasks in realistic computer environments

Created 1 year ago
2,149 stars

Top 21.0% on SourcePulse

GitHubView on GitHub
Project Summary

OSWorld provides a benchmark for multimodal agents to perform open-ended tasks within real computer environments, targeting AI researchers and developers building agents that interact with graphical user interfaces. It enables the evaluation of agent capabilities in realistic desktop and web application scenarios.

How It Works

OSWorld leverages virtual machine technology (VMware, VirtualBox, Docker) to create isolated, reproducible environments that mimic real computer systems. Agents interact with these environments using a combination of visual observations (screenshots) and potentially accessibility tree information, executing actions via simulated mouse and keyboard inputs (e.g., using pyautogui). This approach allows for complex, multi-step task execution and evaluation in a controlled yet realistic setting.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (Python >= 3.9), and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Requires VMware Workstation Pro (or Fusion for Apple Chips) and configuring vmrun, or Docker with KVM support.
  • Setup: The setup script automatically downloads and configures necessary virtual machines.
  • Documentation: Website, Paper, Doc

Highlighted Details

  • Supports multiple VM providers: VMware, VirtualBox, and Docker.
  • Offers various observation types: screenshots, accessibility trees, etc.
  • Includes baseline agents for GPT-4V, Gemini-ProV, and Claude-3 Opus.
  • Provides detailed evaluation metrics and results visualization tools.

Maintenance & Community

The project is associated with NeurIPS 2024 and has active development with recent updates supporting Docker and expanding VM provider options. A Discord server is available for community engagement.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.

Limitations & Caveats

VMware support on macOS may have limitations, and KVM support is generally not available on macOS hosts. Running experiments can be time-consuming and incur costs, especially with powerful models and extensive testing. Residual Docker containers may require manual cleanup.

Health Check
Last Commit

16 hours ago

Responsiveness

1 day

Pull Requests (30d)
10
Issues (30d)
20
Star History
71 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
2 more.

tau-bench by sierra-research

1.4%
840
Benchmark for tool-agent-user interaction research
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.