GLM-V by zai-org

Multimodal reasoning model with a "thinking" paradigm

Created 6 months ago

2,113 stars

Top 21.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

GLM-4.1V-Thinking is an open-source Vision-Language Model (VLM) designed for advanced multimodal reasoning. It targets researchers and developers building complex AI applications requiring sophisticated understanding of visual and textual information, offering state-of-the-art performance for its parameter size.

How It Works

This model introduces a "thinking paradigm" enhanced by Reinforcement Learning with Curriculum Sampling (RLCS). This approach aims to improve accuracy, comprehensiveness, and intelligence in complex tasks, moving beyond basic multimodal perception. The model supports a 64k context length and handles arbitrary aspect ratios with up to 4K image resolution.

Quick Start & Requirements

Inference: Can be run using Hugging Face transformers or vLLM.
- transformers CLI: trans_infer_cli.py
- vLLM API: vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /
Prerequisites: NVIDIA GPU (A100 recommended for optimal performance).
- transformers inference requires ~22GB VRAM (BF16 precision).
- vLLM inference requires ~22GB VRAM (BF16 precision).
Fine-tuning: Supported by LLaMA-Factory. Requires ~21GB VRAM for LoRA fine-tuning.
Demos: Online demos available on Hugging Face and ModelScope.
Docs: Inference scripts and API examples are provided within the repository.

Highlighted Details

Achieves state-of-the-art performance among 10B-parameter VLMs, matching or exceeding 72B models on 18 benchmarks.
Supports 64k context length and arbitrary aspect ratios up to 4K resolution.
Offers a "thinking" mechanism that can interrupt generation to prompt for a final answer.
Provides both a reasoning-focused model (GLM-4.1V-9B-Thinking) and a base model (GLM-4.1V-9B-Base) for research.

Maintenance & Community

Developed by THUDM.
Community channels include WeChat and Discord.
Paper available on arXiv: 2507.01006.

Licensing & Compatibility

Code is released under Apache License 2.0.
Models (GLM-4.1V-9B-Thinking and GLM-4.1V-9B-Base) are licensed under the MIT License.
MIT license generally permits commercial use and linking with closed-source software.

Limitations & Caveats

The provided inference scripts are primarily for transformers; custom implementation is needed for vLLM to support the "thinking" interruption logic. Video input support requires modifications to the provided scripts.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

143 stars in the last 30 days