CogVLM  by zai-org

VLM for image understanding and multi-turn dialogue

created 1 year ago
6,628 stars

Top 7.8% on sourcepulse

GitHubView on GitHub
Project Summary

CogVLM and CogAgent are open-source visual language models (VLMs) designed for advanced image understanding and interaction. CogVLM excels at image captioning and visual question answering, while CogAgent extends these capabilities with GUI agent functionalities, enabling interaction with graphical user interfaces. Both models are suitable for researchers and developers working on multimodal AI applications.

How It Works

CogVLM utilizes a large-scale architecture with 10 billion visual and 7 billion language parameters, supporting high-resolution image inputs (490x490). CogAgent builds upon this, increasing visual parameters to 11 billion and supporting even higher resolutions (1120x1120). This design allows for detailed image comprehension and sophisticated task execution within GUI environments.

Quick Start & Requirements

  • Inference:
    • CLI (SAT version): python cli_demo_sat.py --from_pretrained <model_name> --version <version> --bf16
    • CLI (Huggingface version): python cli_demo_hf.py --from_pretrained <model_name> --bf16
    • Web Demo: python web_demo.py --from_pretrained <model_name> --version <version> --bf16
  • Prerequisites: CUDA >= 11.8, Python, spacy (python -m spacy download en_core_web_sm).
  • Hardware: For INT4 quantization, a 24GB GPU is recommended (CogAgent ~12.6GB, CogVLM ~11GB). FP16 inference requires an 80GB GPU or multiple 24GB GPUs.
  • Documentation: CogVLM & CogAgent Technical Documentation (Chinese)
  • Web Demo: CogVLM2 Demo

Highlighted Details

  • State-of-the-art performance on 10+ cross-modal benchmarks.
  • CogAgent supports GUI agent tasks, including plan generation and action execution with coordinates.
  • Supports 4-bit quantization for reduced memory footprint.
  • CogVLM2, based on Llama3-8b, claims performance on par with or exceeding GPT-4V.

Maintenance & Community

  • Active development with recent releases (CogVLM2, CogAgent).
  • Community support channels are not explicitly listed in the README.
  • GitHub Repository

Licensing & Compatibility

  • Code is licensed under Apache-2.0.
  • Model weights are subject to a separate "Model License" which may have restrictions.

Limitations & Caveats

  • Detailed technical documentation is primarily in Chinese.
  • Fine-tuning requires specific dataset preparation and command execution.
  • GUI Agent tasks are recommended for single-round dialogues for optimal results.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
140 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.