Multimodal agent benchmark for open-ended tasks in realistic computer environments
Top 22.5% on sourcepulse
OSWorld provides a benchmark for multimodal agents to perform open-ended tasks within real computer environments, targeting AI researchers and developers building agents that interact with graphical user interfaces. It enables the evaluation of agent capabilities in realistic desktop and web application scenarios.
How It Works
OSWorld leverages virtual machine technology (VMware, VirtualBox, Docker) to create isolated, reproducible environments that mimic real computer systems. Agents interact with these environments using a combination of visual observations (screenshots) and potentially accessibility tree information, executing actions via simulated mouse and keyboard inputs (e.g., using pyautogui
). This approach allows for complex, multi-step task execution and evaluation in a controlled yet realistic setting.
Quick Start & Requirements
pip install -r requirements.txt
.vmrun
, or Docker with KVM support.Highlighted Details
Maintenance & Community
The project is associated with NeurIPS 2024 and has active development with recent updates supporting Docker and expanding VM provider options. A Discord server is available for community engagement.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification on the licensing terms.
Limitations & Caveats
VMware support on macOS may have limitations, and KVM support is generally not available on macOS hosts. Running experiments can be time-consuming and incur costs, especially with powerful models and extensive testing. Residual Docker containers may require manual cleanup.
3 days ago
1 day