EvoCUA by meituan

Advanced multimodal agent for evolving computer use

Created 6 months ago

332 stars

Top 82.2% on SourcePulse

Project Summary

Summary

EvoCUA is a general-purpose multimodal agent designed to automate complex computer use tasks. It targets researchers and power users seeking to streamline workflows by enabling AI to interact with desktop applications like web browsers, office suites, and code editors. The primary benefit is achieving high-level task completion through natural language instructions and visual input.

How It Works

This project introduces a novel data synthesis and training methodology that enhances the computer use capabilities of existing open-source Vision-Language Models (VLMs). EvoCUA operates end-to-end, processing screenshots and natural language prompts to execute multi-turn interactions with applications such as Chrome, Excel, and VSCode, offering a significant advantage in automation efficiency.

Quick Start & Requirements

Python 3.12 is recommended. Installation involves cloning the repository, setting up a virtual environment, and installing dependencies via pip install -r requirements.txt. Model weights must be downloaded from HuggingFace. Deployment requires vLLM to serve as an OpenAI-compatible inference server. Key dependencies include torch>=2.8.0+cu126, transformers>=4.57.3, and vllm>=0.11.0, implying a need for CUDA 12.6 compatible hardware. The project provides links to model weights on HuggingFace and the OSWorld benchmark.

Highlighted Details

EvoCUA-32B achieved the #1 rank among open-source models on the OSWorld benchmark as of January 2026, scoring 56.7% task completion.
Demonstrates significant performance gains (+11.7% over OpenCUA-72B, +15.1% over Qwen3-VL thinking) with fewer parameters and reduced steps.
Excels at end-to-end multi-turn automation across applications like Chrome, Excel, PowerPoint, and VSCode.
The smaller EvoCUA-8B model also shows strong performance (46.1% on OSWorld), competitive with larger models.

Maintenance & Community

The project is developed by the Meituan LongCat Team. Specific community channels (e.g., Discord, Slack) or detailed contributor information beyond the team name are not explicitly detailed in the README.

Licensing & Compatibility

Licensed under the Apache 2.0 License, which permits commercial use and distribution. No specific compatibility restrictions for closed-source linking are mentioned.

Limitations & Caveats

While EvoCUA leads open-source models on OSWorld, human-level performance remains substantially higher, indicating ongoing research and development potential. The benchmark focuses on specific computer use tasks, and performance on broader, unrepresented tasks may vary.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days