Multimodal reasoning model with a "thinking" paradigm
Top 39.5% on sourcepulse
GLM-4.1V-Thinking is an open-source Vision-Language Model (VLM) designed for advanced multimodal reasoning. It targets researchers and developers building complex AI applications requiring sophisticated understanding of visual and textual information, offering state-of-the-art performance for its parameter size.
How It Works
This model introduces a "thinking paradigm" enhanced by Reinforcement Learning with Curriculum Sampling (RLCS). This approach aims to improve accuracy, comprehensiveness, and intelligence in complex tasks, moving beyond basic multimodal perception. The model supports a 64k context length and handles arbitrary aspect ratios with up to 4K image resolution.
Quick Start & Requirements
transformers
or vLLM
.
transformers
CLI: trans_infer_cli.py
vLLM
API: vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /
transformers
inference requires ~22GB VRAM (BF16 precision).vLLM
inference requires ~22GB VRAM (BF16 precision).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The provided inference scripts are primarily for transformers; custom implementation is needed for vLLM to support the "thinking" interruption logic. Video input support requires modifications to the provided scripts.
1 week ago
Inactive