Multimodal dialog language model for images, Chinese, and English
Top 12.0% on sourcepulse
VisualGLM-6B is an open-source, multimodal conversational language model supporting images, Chinese, and English. It targets developers and researchers working with multimodal AI, offering a 7.8B parameter model built on ChatGLM-6B and BLIP2-Qformer for visual-language bridging. The model enables image-based Q&A and dialogue, with a focus on efficient deployment on consumer hardware.
How It Works
VisualGLM-6B integrates a 6.2B parameter language model (ChatGLM-6B) with a vision component trained using BLIP2-Qformer. This approach bridges visual and linguistic modalities by aligning visual information into the language model's semantic space. Pre-training uses 30M Chinese and 300M English image-text pairs, followed by fine-tuning on long-form visual Q&A data to align with human preferences. The model is trained using the SwissArmyTransformer (sat) library, which supports parameter-efficient fine-tuning methods like LoRA and P-tuning.
Quick Start & Requirements
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt
(or requirements_wo_ds.txt
to skip deepspeed
).transformers
, SwissArmyTransformer>=0.4.4
. GPU with CUDA is recommended for inference. INT4 quantization requires as little as 6.3GB VRAM.transformers
or SwissArmyTransformer
interfaces are provided. See Hugging Face Hub for model implementations.Highlighted Details
transformers
and SwissArmyTransformer
(sat) interfaces.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The v1 model exhibits limitations including factual inaccuracies/hallucinations in image descriptions, attribute misplacement in multi-object scenes, and insufficient detail capture due to a 224x224 input resolution. It currently lacks robust Chinese OCR capabilities.
11 months ago
1 day