OpenCUA by xlang-ai

Framework for computer-use agents

Created 6 months ago

633 stars

Top 52.4% on SourcePulse

Project Summary

OpenCUA provides a comprehensive open-source framework for scaling computer-use agent (CUA) data and foundation models. It addresses the need for robust, generalizable agents capable of performing complex tasks across various applications and operating systems. The framework is targeted at researchers and developers in AI, particularly those working on embodied AI, robotics, and intelligent automation, offering a significant advancement in open-source CUA capabilities.

How It Works

OpenCUA comprises AgentNet, a large-scale dataset of human computer-use demonstrations; AgentNetTool, an annotation infrastructure for capturing these demonstrations; AgentNetBench, an offline evaluator for benchmarking agent actions; and OpenCUA Models, end-to-end foundation models trained on the AgentNet dataset. The core innovation lies in the scale and diversity of the AgentNet dataset, coupled with the framework's ability to process raw demonstrations into concise state-action pairs and synthesize reflective Chain-of-Thought (CoT) reasoning, which enhances model robustness and generalization.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using pip install -r requirement.txt within a conda environment (conda create -n opencua python=3.10).
Model Download: Use huggingface_hub to download model weights (e.g., xlangai/OpenCUA-7B).
Prerequisites: Python 3.10, conda, huggingface_hub. Specific model versions may require alignment with Kimi-VL's Tokenizer and ChatTemplate.
Running Examples: Execute grounding examples via python huggingface_inference.py in the ./model/inference/ directory. Run agents in the OSWorld environment using provided commands (e.g., python run_multienv_opencua.py ...).
Links: AgentNet Huggingface Dataset, AgentNetTool Document

Highlighted Details

OpenCUA-32B achieves a SOTA 34.8% success rate on OSWorld-Verified among open-source models.
The AgentNet dataset contains 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu.
AgentNetTool supports synchronized capture of screen video, mouse/keyboard events, and accessibility trees.
The framework includes a DataProcessor for action reduction and state-action matching, and a CoTGenerator for synthesizing reflective reasoning.

Maintenance & Community

The project acknowledges contributions from various individuals and teams, including Moonshot AI and the Kimi Team, and is built upon DuckTrack and OpenAdapt. Further details on community channels or roadmap are not explicitly provided in the README.

Licensing & Compatibility

The project is intended for research and educational purposes only. Prohibited uses include any activity violating applicable laws or regulations, and illegal, unethical, or harmful activities. The authors disclaim responsibility for any misuse.

Limitations & Caveats

vLLM support is currently in progress, with users advised to use the standard transformers library. The training code is also under development, with models based on the Kimi Team's infrastructure.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

39 stars in the last 30 days