CogVLM by zai-org

VLM for image understanding and multi-turn dialogue

Created 2 years ago

6,714 stars

Top 7.6% on SourcePulse

9 Experts Love This Project

sxyu

Research Scientist at OpenAI; Cofounder of Luma AI

omarsar

Founder of DAIR.AI

luiscape

Cofounder of Lightning AI

vincentweisser

Vincent Weisser

Cofounder of Prime Intellect

and 5 more!

Project Summary

CogVLM and CogAgent are open-source visual language models (VLMs) designed for advanced image understanding and interaction. CogVLM excels at image captioning and visual question answering, while CogAgent extends these capabilities with GUI agent functionalities, enabling interaction with graphical user interfaces. Both models are suitable for researchers and developers working on multimodal AI applications.

How It Works

CogVLM utilizes a large-scale architecture with 10 billion visual and 7 billion language parameters, supporting high-resolution image inputs (490x490). CogAgent builds upon this, increasing visual parameters to 11 billion and supporting even higher resolutions (1120x1120). This design allows for detailed image comprehension and sophisticated task execution within GUI environments.

Quick Start & Requirements

Inference:
- CLI (SAT version): python cli_demo_sat.py --from_pretrained <model_name> --version <version> --bf16
- CLI (Huggingface version): python cli_demo_hf.py --from_pretrained <model_name> --bf16
- Web Demo: python web_demo.py --from_pretrained <model_name> --version <version> --bf16
Prerequisites: CUDA >= 11.8, Python, spacy (python -m spacy download en_core_web_sm).
Hardware: For INT4 quantization, a 24GB GPU is recommended (CogAgent ~12.6GB, CogVLM ~11GB). FP16 inference requires an 80GB GPU or multiple 24GB GPUs.
Documentation: CogVLM & CogAgent Technical Documentation (Chinese)
Web Demo: CogVLM2 Demo

Highlighted Details

State-of-the-art performance on 10+ cross-modal benchmarks.
CogAgent supports GUI agent tasks, including plan generation and action execution with coordinates.
Supports 4-bit quantization for reduced memory footprint.
CogVLM2, based on Llama3-8b, claims performance on par with or exceeding GPT-4V.

Maintenance & Community

Active development with recent releases (CogVLM2, CogAgent).
Community support channels are not explicitly listed in the README.
GitHub Repository

Licensing & Compatibility

Code is licensed under Apache-2.0.
Model weights are subject to a separate "Model License" which may have restrictions.

Limitations & Caveats

Detailed technical documentation is primarily in Chinese.
Fine-tuning requires specific dataset preparation and command execution.
GUI Agent tasks are recommended for single-round dialogues for optimal results.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

14 stars in the last 30 days

Explore Similar Projects

Pixel-Reasoner by TIGER-AI-Lab

Enables pixel-space reasoning for Vision-Language Models

Created 7 months ago

Updated 2 months ago

MIC by HaozheZhao

VLM for multimodal in-context learning research

Created 2 years ago

Updated 2 years ago

ViP-LLaVA by WisconsinAIVision

Multimodal model for understanding visual prompts

Created 2 years ago

Updated 1 year ago

Osprey by CircleRadon

Research paper for pixel understanding via visual instruction tuning

Created 2 years ago

Updated 4 months ago

Seed1.5-VL by ByteDance-Seed

Vision-language foundation model for multimodal understanding/reasoning

Created 8 months ago

Updated 7 months ago

Starred by

Haotian Liu

Haotian Liu(Author of LLaVA; Research Scientist at xAI) and

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI).

LLaVA-Plus-Codebase by LLaVA-VL

Multimodal agent for vision tasks using external tools

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Kimi-VL by MoonshotAI

Vision-language model for multimodal reasoning and agent tasks

Created 9 months ago

Updated 6 months ago

MM-REACT by microsoft

MM-REACT is a system for multimodal reasoning and action

Created 2 years ago

Updated 1 year ago

VideoLLaMA3 by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

Created 11 months ago

Updated 5 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic) and

Jianwei Yang

Jianwei Yang(Research Scientist at Meta Superintelligence Lab).

Magma by microsoft

Multimodal AI agent foundation model research paper

Created 1 year ago

Updated 3 months ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Bagel by ByteDance-Seed

Unified multimodal foundation model

Created 8 months ago

Updated 2 months ago

Starred by

Andrew Ng

Andrew Ng(Founder of DeepLearning.AI; Cofounder of Coursera; Professor at Stanford),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

2 more.

vision-agent by landing-ai

Visual AI agent for generating runnable vision code from image/video prompts

Created 1 year ago

Updated 1 month ago

Feedback? Help us improve.