CogAgent  by zai-org

VLM-based GUI agent for automating graphical user interfaces

Created 1 year ago
1,056 stars

Top 35.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

CogAgent is an open-source, end-to-end Vision-Language Model (VLM) designed to act as a GUI agent. It enables automated interaction with graphical user interfaces through natural language commands and screen captures, targeting researchers and developers looking to build sophisticated automation tools.

How It Works

CogAgent is built upon the GLM-4V-9B VLM, enhanced through extensive data collection, multi-stage training, and strategic optimizations. This approach significantly improves its GUI perception, reasoning accuracy, action completeness, and task generalization. The model processes screen captures and natural language, outputting specific actions with bounding box coordinates for GUI element interaction.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Run Inference: python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive
  • Prerequisites: Python 3.10.16+, NVIDIA GPU.
  • VRAM: Minimum 29GB for BF16 inference. INT8 requires ~15GB, INT4 ~8GB (with performance loss).
  • Fine-tuning: SFT requires ~60GB VRAM per GPU (8x A100), LoRA requires ~70GB VRAM on a single GPU.
  • Demo: HuggingFace Space, ModelScope Space

Highlighted Details

  • State-of-the-art performance on GUI Agent benchmarks like Screenspot, OmniAct, and CogAgentBench-basic-cn.
  • Supports bilingual (Chinese/English) interaction.
  • Offers multiple output formats (e.g., Action-Operation, Status-Plan-Action-Operation-Sensitive).
  • Can be fine-tuned for custom tasks.

Maintenance & Community

Licensing & Compatibility

  • Code: Apache 2.0 License.
  • Model Weights: Follows a separate Model License.

Limitations & Caveats

  • Limited platform support: Primarily tested on Windows and macOS; effectiveness on other systems may be suboptimal.
  • Fine-tuning requires substantial GPU memory, and SFT requires freezing the Vision Encoder.
  • Online demos do not support computer control, only inference viewing.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), Anton Osika Anton Osika(Cofounder of Lovable), and
3 more.

gptme by gptme

0.3%
4k
CLI tool for terminal agent workflows
Created 2 years ago
Updated 21 hours ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
7 more.

CogVLM by zai-org

0.0%
7k
VLM for image understanding and multi-turn dialogue
Created 2 years ago
Updated 1 year ago
Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
2 more.

UI-TARS-desktop by bytedance

1.1%
19k
GUI agent app for computer control via natural language
Created 8 months ago
Updated 15 hours ago
Feedback? Help us improve.