VisualGLM-6B  by zai-org

Multimodal dialog language model for images, Chinese, and English

Created 2 years ago
4,164 stars

Top 11.8% on SourcePulse

GitHubView on GitHub
Project Summary

VisualGLM-6B is an open-source, multimodal conversational language model supporting images, Chinese, and English. It targets developers and researchers working with multimodal AI, offering a 7.8B parameter model built on ChatGLM-6B and BLIP2-Qformer for visual-language bridging. The model enables image-based Q&A and dialogue, with a focus on efficient deployment on consumer hardware.

How It Works

VisualGLM-6B integrates a 6.2B parameter language model (ChatGLM-6B) with a vision component trained using BLIP2-Qformer. This approach bridges visual and linguistic modalities by aligning visual information into the language model's semantic space. Pre-training uses 30M Chinese and 300M English image-text pairs, followed by fine-tuning on long-form visual Q&A data to align with human preferences. The model is trained using the SwissArmyTransformer (sat) library, which supports parameter-efficient fine-tuning methods like LoRA and P-tuning.

Quick Start & Requirements

  • Install: pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt (or requirements_wo_ds.txt to skip deepspeed).
  • Dependencies: Python, transformers, SwissArmyTransformer>=0.4.4. GPU with CUDA is recommended for inference. INT4 quantization requires as little as 6.3GB VRAM.
  • Usage: Huggingface transformers or SwissArmyTransformer interfaces are provided. See Hugging Face Hub for model implementations.

Highlighted Details

  • Supports both Huggingface transformers and SwissArmyTransformer (sat) interfaces.
  • Offers parameter-efficient fine-tuning (LoRA, QLoRA, P-tuning) and model merging capabilities.
  • Provides command-line, Gradio web UI, and API deployment options.
  • Supports 4-bit and 8-bit quantization for reduced VRAM usage during inference.

Maintenance & Community

  • The project is from THUDM.
  • Links to Slack and WeChat News are provided.
  • Mentions upcoming CogVLM models.

Licensing & Compatibility

  • Code is licensed under Apache-2.0.
  • Model weights are subject to a separate "Model License".
  • Users are cautioned against using the model for harmful purposes or services without safety assessment.

Limitations & Caveats

The v1 model exhibits limitations including factual inaccuracies/hallucinations in image descriptions, attribute misplacement in multi-object scenes, and insufficient detail capture due to a 224x224 input resolution. It currently lacks robust Chinese OCR capabilities.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

X-LLM by phellonchen

0%
314
Multimodal LLM research paper
Created 2 years ago
Updated 2 years ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Zack Li Zack Li(Cofounder of Nexa AI), and
19 more.

LLaVA by haotian-liu

0.2%
24k
Multimodal assistant with GPT-4 level capabilities
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.