Discover and explore top open-source AI tools and projects—updated daily.
meituanAdvanced multimodal agent for evolving computer use
Top 95.9% on SourcePulse
Summary
EvoCUA is a general-purpose multimodal agent designed to automate complex computer use tasks. It targets researchers and power users seeking to streamline workflows by enabling AI to interact with desktop applications like web browsers, office suites, and code editors. The primary benefit is achieving high-level task completion through natural language instructions and visual input.
How It Works
This project introduces a novel data synthesis and training methodology that enhances the computer use capabilities of existing open-source Vision-Language Models (VLMs). EvoCUA operates end-to-end, processing screenshots and natural language prompts to execute multi-turn interactions with applications such as Chrome, Excel, and VSCode, offering a significant advantage in automation efficiency.
Quick Start & Requirements
Python 3.12 is recommended. Installation involves cloning the repository, setting up a virtual environment, and installing dependencies via pip install -r requirements.txt. Model weights must be downloaded from HuggingFace. Deployment requires vLLM to serve as an OpenAI-compatible inference server. Key dependencies include torch>=2.8.0+cu126, transformers>=4.57.3, and vllm>=0.11.0, implying a need for CUDA 12.6 compatible hardware. The project provides links to model weights on HuggingFace and the OSWorld benchmark.
Highlighted Details
Maintenance & Community
The project is developed by the Meituan LongCat Team. Specific community channels (e.g., Discord, Slack) or detailed contributor information beyond the team name are not explicitly detailed in the README.
Licensing & Compatibility
Licensed under the Apache 2.0 License, which permits commercial use and distribution. No specific compatibility restrictions for closed-source linking are mentioned.
Limitations & Caveats
While EvoCUA leads open-source models on OSWorld, human-level performance remains substantially higher, indicating ongoing research and development potential. The benchmark focuses on specific computer use tasks, and performance on broader, unrepresented tasks may vary.
1 month ago
Inactive
allenai
anthropics