CogAgent by zai-org

VLM-based GUI agent for automating graphical user interfaces

Created 2 years ago

1,118 stars

Top 34.2% on SourcePulse

1 Expert Loves This Project

omarsar

Founder of DAIR.AI

Project Summary

CogAgent is an open-source, end-to-end Vision-Language Model (VLM) designed to act as a GUI agent. It enables automated interaction with graphical user interfaces through natural language commands and screen captures, targeting researchers and developers looking to build sophisticated automation tools.

How It Works

CogAgent is built upon the GLM-4V-9B VLM, enhanced through extensive data collection, multi-stage training, and strategic optimizations. This approach significantly improves its GUI perception, reasoning accuracy, action completeness, and task generalization. The model processes screen captures and natural language, outputting specific actions with bounding box coordinates for GUI element interaction.

Quick Start & Requirements

Install: pip install -r requirements.txt
Run Inference: python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive
Prerequisites: Python 3.10.16+, NVIDIA GPU.
VRAM: Minimum 29GB for BF16 inference. INT8 requires ~15GB, INT4 ~8GB (with performance loss).
Fine-tuning: SFT requires ~60GB VRAM per GPU (8x A100), LoRA requires ~70GB VRAM on a single GPU.
Demo: HuggingFace Space, ModelScope Space

Highlighted Details

State-of-the-art performance on GUI Agent benchmarks like Screenspot, OmniAct, and CogAgentBench-basic-cn.
Supports bilingual (Chinese/English) interaction.
Offers multiple output formats (e.g., Action-Operation, Status-Plan-Action-Operation-Sensitive).
Can be fine-tuned for custom tasks.

Maintenance & Community

Developed by Tsinghua University and Zhipu AI.
Latest model version: cogagent-9b-20241220 (December 2024).
Official Technical Blog
Practical Guide (Chinese)

Licensing & Compatibility

Code: Apache 2.0 License.
Model Weights: Follows a separate Model License.

Limitations & Caveats

Limited platform support: Primarily tested on Windows and macOS; effectiveness on other systems may be suboptimal.
Fine-tuning requires substantial GPU memory, and SFT requires freezing the Vision Encoder.
Online demos do not support computer control, only inference viewing.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

19 stars in the last 30 days

Explore Similar Projects

Awesome-GUI-Agents by ZJU-REAL

A curated collection for developing advanced GUI agents

Created 9 months ago

Updated 1 day ago

UGround by OSU-NLP-Group

GUI visual grounding for GUI agents

Created 1 year ago

Updated 5 months ago

SeeClick by njucckevin

Visual GUI agent for grounding and interacting with graphical user interfaces

Created 2 years ago

Updated 6 months ago

OS-Atlas by OS-Copilot

Foundation action model for GUI agents (research paper)

Created 1 year ago

Updated 8 months ago

Aria-UI by AriaUI

GUI agent for context-aware action grounding from instructions

Created 1 year ago

Updated 11 months ago

Peekaboo by steipete

macOS GUI automation and screenshot analysis tool

Created 7 months ago

Updated 1 week ago

ScaleCUA by OpenGVLab

Cross-platform computer use agents for GUI automation

Created 4 months ago

Updated 4 days ago

Starred by

Max Liu

Max Liu(Cofounder of PingCAP),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

1 more.

ShowUI by showlab

Vision-language-action model for GUI agent & computer use (CVPR 2025 paper)

Created 1 year ago

Updated 7 months ago

Starred by

Abubakar Abid

Abubakar Abid(Cofounder of Gradio).

computer_use_ootb by showlab

GUI agent for Windows and macOS

Created 1 year ago

Updated 7 months ago

fara by microsoft

Agentic model for visual computer task automation

Created 2 months ago

Updated 3 weeks ago

Starred by

Edward Z. Yang

Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch),

Anton Osika

Anton Osika(Cofounder of Lovable), and

3 more.

gptme by gptme

CLI tool for terminal agent workflows

Created 2 years ago

Updated 16 hours ago

Starred by

Jason Huggins

Jason Huggins(Creator of Selenium),

Eric Zhu

Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and

3 more.

UI-TARS-desktop by bytedance

GUI agent app for computer control via natural language

Created 11 months ago

Updated 6 days ago

Feedback? Help us improve.