CogVLM2 by zai-org

Multimodal model for image and video understanding, GPT4V-level performance

Created 1 year ago

2,427 stars

Top 18.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Robin Rombach

Cofounder of Black Forest Labs

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

CogVLM2 is an open-source multimodal large language model series based on Llama3-8B, designed for advanced image and video understanding tasks. It offers GPT-4V-level performance, supporting both English and Chinese, with specialized versions for video analysis.

How It Works

CogVLM2 integrates a vision encoder with the Llama3-8B language model. It processes images at resolutions up to 1344x1344 and supports an 8K context length for text. The video models extract keyframes to interpret continuous visual data, handling up to one minute of video. This architecture allows for sophisticated multimodal reasoning and dialogue.

Quick Start & Requirements

Installation: Primarily through Hugging Face or ModelScope.
Dependencies: Python, PyTorch. GPU with CUDA is recommended for optimal performance.
Int4 Version: Requires 16GB VRAM for inference.
Links: Official Page, Huggingface, CogVLM2-Video

Highlighted Details

Achieves state-of-the-art performance on benchmarks like TextVQA, DocVQA, and MVBench.
Supports 8K context length and high-resolution images (1344x1344).
Offers an Int4 quantized version for reduced VRAM usage (16GB).
Includes CogVLM2-Video for up to 1-minute video understanding.

Maintenance & Community

The project is actively developed by THUDM. Updates include the release of CogVLM2-Video and TGI weights. Community inference solutions are available via xinference.

Licensing & Compatibility

Released under CogVLM2 LICENSE and LLAMA3_LICENSE. Users must adhere to both licenses, which may have implications for commercial use or closed-source linking.

Limitations & Caveats

The CogVLM2-Video model processes videos by extracting keyframes (24 frames) and may not capture all temporal nuances. The specific licensing terms for Llama3 should be carefully reviewed for commercial applications.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days