VisualGLM-6B by zai-org

Multimodal dialog language model for images, Chinese, and English

Created 2 years ago

4,176 stars

Top 11.7% on SourcePulse

Project Summary

VisualGLM-6B is an open-source, multimodal conversational language model supporting images, Chinese, and English. It targets developers and researchers working with multimodal AI, offering a 7.8B parameter model built on ChatGLM-6B and BLIP2-Qformer for visual-language bridging. The model enables image-based Q&A and dialogue, with a focus on efficient deployment on consumer hardware.

How It Works

VisualGLM-6B integrates a 6.2B parameter language model (ChatGLM-6B) with a vision component trained using BLIP2-Qformer. This approach bridges visual and linguistic modalities by aligning visual information into the language model's semantic space. Pre-training uses 30M Chinese and 300M English image-text pairs, followed by fine-tuning on long-form visual Q&A data to align with human preferences. The model is trained using the SwissArmyTransformer (sat) library, which supports parameter-efficient fine-tuning methods like LoRA and P-tuning.

Quick Start & Requirements

Install: pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt (or requirements_wo_ds.txt to skip deepspeed).
Dependencies: Python, transformers, SwissArmyTransformer>=0.4.4. GPU with CUDA is recommended for inference. INT4 quantization requires as little as 6.3GB VRAM.
Usage: Huggingface transformers or SwissArmyTransformer interfaces are provided. See Hugging Face Hub for model implementations.

Highlighted Details

Supports both Huggingface transformers and SwissArmyTransformer (sat) interfaces.
Offers parameter-efficient fine-tuning (LoRA, QLoRA, P-tuning) and model merging capabilities.
Provides command-line, Gradio web UI, and API deployment options.
Supports 4-bit and 8-bit quantization for reduced VRAM usage during inference.

Maintenance & Community

The project is from THUDM.
Links to Slack and WeChat News are provided.
Mentions upcoming CogVLM models.

Licensing & Compatibility

Code is licensed under Apache-2.0.
Model weights are subject to a separate "Model License".
Users are cautioned against using the model for harmful purposes or services without safety assessment.

Limitations & Caveats

The v1 model exhibits limitations including factual inaccuracies/hallucinations in image descriptions, attribute misplacement in multi-object scenes, and insufficient detail capture due to a 224x224 input resolution. It currently lacks robust Chinese OCR capabilities.

VisualGLM-6B by zai-org

Explore Similar Projects

X-LLM by phellonchen

Aurora by WangRongsheng

KoLLaVA by tabtoyou

Chinese-LLaVA by LinkSoul-AI

Visual-Chinese-LLaMA-Alpaca by airaria

Chinese-Mixtral by ymcui

XVERSE-13B by xverse-ai

VisCPM by OpenBMB

Multimodal-GPT by open-mmlab

py-gpt by szczyglis-dev

BELLE by LianjiaTech

LLaVA by haotian-liu