CogAgent  by zai-org

VLM-based GUI agent for automating graphical user interfaces

Created 2 years ago
1,118 stars

Top 34.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

CogAgent is an open-source, end-to-end Vision-Language Model (VLM) designed to act as a GUI agent. It enables automated interaction with graphical user interfaces through natural language commands and screen captures, targeting researchers and developers looking to build sophisticated automation tools.

How It Works

CogAgent is built upon the GLM-4V-9B VLM, enhanced through extensive data collection, multi-stage training, and strategic optimizations. This approach significantly improves its GUI perception, reasoning accuracy, action completeness, and task generalization. The model processes screen captures and natural language, outputting specific actions with bounding box coordinates for GUI element interaction.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Run Inference: python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive
  • Prerequisites: Python 3.10.16+, NVIDIA GPU.
  • VRAM: Minimum 29GB for BF16 inference. INT8 requires ~15GB, INT4 ~8GB (with performance loss).
  • Fine-tuning: SFT requires ~60GB VRAM per GPU (8x A100), LoRA requires ~70GB VRAM on a single GPU.
  • Demo: HuggingFace Space, ModelScope Space

Highlighted Details

  • State-of-the-art performance on GUI Agent benchmarks like Screenspot, OmniAct, and CogAgentBench-basic-cn.
  • Supports bilingual (Chinese/English) interaction.
  • Offers multiple output formats (e.g., Action-Operation, Status-Plan-Action-Operation-Sensitive).
  • Can be fine-tuned for custom tasks.

Maintenance & Community

Licensing & Compatibility

  • Code: Apache 2.0 License.
  • Model Weights: Follows a separate Model License.

Limitations & Caveats

  • Limited platform support: Primarily tested on Windows and macOS; effectiveness on other systems may be suboptimal.
  • Fine-tuning requires substantial GPU memory, and SFT requires freezing the Vision Encoder.
  • Online demos do not support computer control, only inference viewing.
Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 30 days

Explore Similar Projects

Starred by Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), Anton Osika Anton Osika(Cofounder of Lovable), and
3 more.

gptme by gptme

0.5%
4k
CLI tool for terminal agent workflows
Created 2 years ago
Updated 16 hours ago
Starred by Jason Huggins Jason Huggins(Creator of Selenium), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
3 more.

UI-TARS-desktop by bytedance

10.3%
22k
GUI agent app for computer control via natural language
Created 11 months ago
Updated 6 days ago
Feedback? Help us improve.