CogVLM2  by zai-org

Multimodal model for image and video understanding, GPT4V-level performance

created 1 year ago
2,402 stars

Top 19.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

CogVLM2 is an open-source multimodal large language model series based on Llama3-8B, designed for advanced image and video understanding tasks. It offers GPT-4V-level performance, supporting both English and Chinese, with specialized versions for video analysis.

How It Works

CogVLM2 integrates a vision encoder with the Llama3-8B language model. It processes images at resolutions up to 1344x1344 and supports an 8K context length for text. The video models extract keyframes to interpret continuous visual data, handling up to one minute of video. This architecture allows for sophisticated multimodal reasoning and dialogue.

Quick Start & Requirements

  • Installation: Primarily through Hugging Face or ModelScope.
  • Dependencies: Python, PyTorch. GPU with CUDA is recommended for optimal performance.
  • Int4 Version: Requires 16GB VRAM for inference.
  • Links: Official Page, Huggingface, CogVLM2-Video

Highlighted Details

  • Achieves state-of-the-art performance on benchmarks like TextVQA, DocVQA, and MVBench.
  • Supports 8K context length and high-resolution images (1344x1344).
  • Offers an Int4 quantized version for reduced VRAM usage (16GB).
  • Includes CogVLM2-Video for up to 1-minute video understanding.

Maintenance & Community

The project is actively developed by THUDM. Updates include the release of CogVLM2-Video and TGI weights. Community inference solutions are available via xinference.

Licensing & Compatibility

Released under CogVLM2 LICENSE and LLAMA3_LICENSE. Users must adhere to both licenses, which may have implications for commercial use or closed-source linking.

Limitations & Caveats

The CogVLM2-Video model processes videos by extracting keyframes (24 frames) and may not capture all temporal nuances. The specific licensing terms for Llama3 should be carefully reviewed for commercial applications.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
77 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

CogVideo by zai-org

0.4%
12k
Text-to-video generation models (CogVideoX, CogVideo)
created 3 years ago
updated 1 month ago
Feedback? Help us improve.