GLM-V  by zai-org

Multimodal reasoning model with a "thinking" paradigm

Created 2 months ago
1,641 stars

Top 25.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GLM-4.1V-Thinking is an open-source Vision-Language Model (VLM) designed for advanced multimodal reasoning. It targets researchers and developers building complex AI applications requiring sophisticated understanding of visual and textual information, offering state-of-the-art performance for its parameter size.

How It Works

This model introduces a "thinking paradigm" enhanced by Reinforcement Learning with Curriculum Sampling (RLCS). This approach aims to improve accuracy, comprehensiveness, and intelligence in complex tasks, moving beyond basic multimodal perception. The model supports a 64k context length and handles arbitrary aspect ratios with up to 4K image resolution.

Quick Start & Requirements

  • Inference: Can be run using Hugging Face transformers or vLLM.
    • transformers CLI: trans_infer_cli.py
    • vLLM API: vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /
  • Prerequisites: NVIDIA GPU (A100 recommended for optimal performance).
    • transformers inference requires ~22GB VRAM (BF16 precision).
    • vLLM inference requires ~22GB VRAM (BF16 precision).
  • Fine-tuning: Supported by LLaMA-Factory. Requires ~21GB VRAM for LoRA fine-tuning.
  • Demos: Online demos available on Hugging Face and ModelScope.
  • Docs: Inference scripts and API examples are provided within the repository.

Highlighted Details

  • Achieves state-of-the-art performance among 10B-parameter VLMs, matching or exceeding 72B models on 18 benchmarks.
  • Supports 64k context length and arbitrary aspect ratios up to 4K resolution.
  • Offers a "thinking" mechanism that can interrupt generation to prompt for a final answer.
  • Provides both a reasoning-focused model (GLM-4.1V-9B-Thinking) and a base model (GLM-4.1V-9B-Base) for research.

Maintenance & Community

  • Developed by THUDM.
  • Community channels include WeChat and Discord.
  • Paper available on arXiv: 2507.01006.

Licensing & Compatibility

  • Code is released under Apache License 2.0.
  • Models (GLM-4.1V-9B-Thinking and GLM-4.1V-9B-Base) are licensed under the MIT License.
  • MIT license generally permits commercial use and linking with closed-source software.

Limitations & Caveats

The provided inference scripts are primarily for transformers; custom implementation is needed for vLLM to support the "thinking" interruption logic. Video input support requires modifications to the provided scripts.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
44
Star History
178 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.