EvoCUA  by meituan

Advanced multimodal agent for evolving computer use

Created 1 month ago
268 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

EvoCUA is a general-purpose multimodal agent designed to automate complex computer use tasks. It targets researchers and power users seeking to streamline workflows by enabling AI to interact with desktop applications like web browsers, office suites, and code editors. The primary benefit is achieving high-level task completion through natural language instructions and visual input.

How It Works

This project introduces a novel data synthesis and training methodology that enhances the computer use capabilities of existing open-source Vision-Language Models (VLMs). EvoCUA operates end-to-end, processing screenshots and natural language prompts to execute multi-turn interactions with applications such as Chrome, Excel, and VSCode, offering a significant advantage in automation efficiency.

Quick Start & Requirements

Python 3.12 is recommended. Installation involves cloning the repository, setting up a virtual environment, and installing dependencies via pip install -r requirements.txt. Model weights must be downloaded from HuggingFace. Deployment requires vLLM to serve as an OpenAI-compatible inference server. Key dependencies include torch>=2.8.0+cu126, transformers>=4.57.3, and vllm>=0.11.0, implying a need for CUDA 12.6 compatible hardware. The project provides links to model weights on HuggingFace and the OSWorld benchmark.

Highlighted Details

  • EvoCUA-32B achieved the #1 rank among open-source models on the OSWorld benchmark as of January 2026, scoring 56.7% task completion.
  • Demonstrates significant performance gains (+11.7% over OpenCUA-72B, +15.1% over Qwen3-VL thinking) with fewer parameters and reduced steps.
  • Excels at end-to-end multi-turn automation across applications like Chrome, Excel, PowerPoint, and VSCode.
  • The smaller EvoCUA-8B model also shows strong performance (46.1% on OSWorld), competitive with larger models.

Maintenance & Community

The project is developed by the Meituan LongCat Team. Specific community channels (e.g., Discord, Slack) or detailed contributor information beyond the team name are not explicitly detailed in the README.

Licensing & Compatibility

Licensed under the Apache 2.0 License, which permits commercial use and distribution. No specific compatibility restrictions for closed-source linking are mentioned.

Limitations & Caveats

While EvoCUA leads open-source models on OSWorld, human-level performance remains substantially higher, indicating ongoing research and development potential. The benchmark focuses on specific computer use tasks, and performance on broader, unrepresented tasks may vary.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
85 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.