VLM-based GUI agent for automating graphical user interfaces
Top 37.7% on sourcepulse
CogAgent is an open-source, end-to-end Vision-Language Model (VLM) designed to act as a GUI agent. It enables automated interaction with graphical user interfaces through natural language commands and screen captures, targeting researchers and developers looking to build sophisticated automation tools.
How It Works
CogAgent is built upon the GLM-4V-9B VLM, enhanced through extensive data collection, multi-stage training, and strategic optimizations. This approach significantly improves its GUI perception, reasoning accuracy, action completeness, and task generalization. The model processes screen captures and natural language, outputting specific actions with bounding box coordinates for GUI element interaction.
Quick Start & Requirements
pip install -r requirements.txt
python inference/cli_demo.py --model_dir THUDM/cogagent-9b-20241220 --platform "Mac" --max_length 4096 --top_k 1 --output_image_path ./results --format_key status_action_op_sensitive
Highlighted Details
Maintenance & Community
cogagent-9b-20241220
(December 2024).Licensing & Compatibility
Limitations & Caveats
4 months ago
1 day