VisualGLM-6B  by zai-org

Multimodal dialog language model for images, Chinese, and English

created 2 years ago
4,157 stars

Top 12.0% on sourcepulse

GitHubView on GitHub
Project Summary

VisualGLM-6B is an open-source, multimodal conversational language model supporting images, Chinese, and English. It targets developers and researchers working with multimodal AI, offering a 7.8B parameter model built on ChatGLM-6B and BLIP2-Qformer for visual-language bridging. The model enables image-based Q&A and dialogue, with a focus on efficient deployment on consumer hardware.

How It Works

VisualGLM-6B integrates a 6.2B parameter language model (ChatGLM-6B) with a vision component trained using BLIP2-Qformer. This approach bridges visual and linguistic modalities by aligning visual information into the language model's semantic space. Pre-training uses 30M Chinese and 300M English image-text pairs, followed by fine-tuning on long-form visual Q&A data to align with human preferences. The model is trained using the SwissArmyTransformer (sat) library, which supports parameter-efficient fine-tuning methods like LoRA and P-tuning.

Quick Start & Requirements

  • Install: pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt (or requirements_wo_ds.txt to skip deepspeed).
  • Dependencies: Python, transformers, SwissArmyTransformer>=0.4.4. GPU with CUDA is recommended for inference. INT4 quantization requires as little as 6.3GB VRAM.
  • Usage: Huggingface transformers or SwissArmyTransformer interfaces are provided. See Hugging Face Hub for model implementations.

Highlighted Details

  • Supports both Huggingface transformers and SwissArmyTransformer (sat) interfaces.
  • Offers parameter-efficient fine-tuning (LoRA, QLoRA, P-tuning) and model merging capabilities.
  • Provides command-line, Gradio web UI, and API deployment options.
  • Supports 4-bit and 8-bit quantization for reduced VRAM usage during inference.

Maintenance & Community

  • The project is from THUDM.
  • Links to Slack and WeChat News are provided.
  • Mentions upcoming CogVLM models.

Licensing & Compatibility

  • Code is licensed under Apache-2.0.
  • Model weights are subject to a separate "Model License".
  • Users are cautioned against using the model for harmful purposes or services without safety assessment.

Limitations & Caveats

The v1 model exhibits limitations including factual inaccuracies/hallucinations in image descriptions, attribute misplacement in multi-object scenes, and insufficient detail capture due to a 224x224 input resolution. It currently lacks robust Chinese OCR capabilities.

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

LWM by LargeWorldModel

0.0%
7k
Multimodal autoregressive model for long-context video/text
created 1 year ago
updated 9 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

ChatGLM-6B by zai-org

0.1%
41k
Bilingual dialogue language model for research
created 2 years ago
updated 1 year ago
Feedback? Help us improve.