CogAgent  by zai-org

VLM-based GUI agent for automating graphical user interfaces

created 1 year ago
1,009 stars

Top 37.7% on sourcepulse

GitHubView on GitHub
Project Summary

CogAgent is an open-source, end-to-end Vision-Language Model (VLM) designed to act as a GUI agent. It enables automated interaction with graphical user interfaces through natural language commands and screen captures, targeting researchers and developers looking to build sophisticated automation tools.

How It Works

CogAgent is built upon the GLM-4V-9B VLM, enhanced through extensive data collection, multi-stage training, and strategic optimizations. This approach significantly improves its GUI perception, reasoning accuracy, action completeness, and task generalization. The model processes screen captures and natural language, outputting specific actions with bounding box coordinates for GUI element interaction.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Run Inference: python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive
  • Prerequisites: Python 3.10.16+, NVIDIA GPU.
  • VRAM: Minimum 29GB for BF16 inference. INT8 requires ~15GB, INT4 ~8GB (with performance loss).
  • Fine-tuning: SFT requires ~60GB VRAM per GPU (8x A100), LoRA requires ~70GB VRAM on a single GPU.
  • Demo: HuggingFace Space, ModelScope Space

Highlighted Details

  • State-of-the-art performance on GUI Agent benchmarks like Screenspot, OmniAct, and CogAgentBench-basic-cn.
  • Supports bilingual (Chinese/English) interaction.
  • Offers multiple output formats (e.g., Action-Operation, Status-Plan-Action-Operation-Sensitive).
  • Can be fine-tuned for custom tasks.

Maintenance & Community

Licensing & Compatibility

  • Code: Apache 2.0 License.
  • Model Weights: Follows a separate Model License.

Limitations & Caveats

  • Limited platform support: Primarily tested on Windows and macOS; effectiveness on other systems may be suboptimal.
  • Fine-tuning requires substantial GPU memory, and SFT requires freezing the Vision Encoder.
  • Online demos do not support computer control, only inference viewing.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
89 stars in the last 90 days

Explore Similar Projects

Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

CogVLM by zai-org

0.1%
7k
VLM for image understanding and multi-turn dialogue
created 1 year ago
updated 1 year ago
Feedback? Help us improve.