GLM-4.1V-Thinking  by zai-org

Multimodal reasoning model with a "thinking" paradigm

created 1 month ago
950 stars

Top 39.5% on sourcepulse

GitHubView on GitHub
Project Summary

GLM-4.1V-Thinking is an open-source Vision-Language Model (VLM) designed for advanced multimodal reasoning. It targets researchers and developers building complex AI applications requiring sophisticated understanding of visual and textual information, offering state-of-the-art performance for its parameter size.

How It Works

This model introduces a "thinking paradigm" enhanced by Reinforcement Learning with Curriculum Sampling (RLCS). This approach aims to improve accuracy, comprehensiveness, and intelligence in complex tasks, moving beyond basic multimodal perception. The model supports a 64k context length and handles arbitrary aspect ratios with up to 4K image resolution.

Quick Start & Requirements

  • Inference: Can be run using Hugging Face transformers or vLLM.
    • transformers CLI: trans_infer_cli.py
    • vLLM API: vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /
  • Prerequisites: NVIDIA GPU (A100 recommended for optimal performance).
    • transformers inference requires ~22GB VRAM (BF16 precision).
    • vLLM inference requires ~22GB VRAM (BF16 precision).
  • Fine-tuning: Supported by LLaMA-Factory. Requires ~21GB VRAM for LoRA fine-tuning.
  • Demos: Online demos available on Hugging Face and ModelScope.
  • Docs: Inference scripts and API examples are provided within the repository.

Highlighted Details

  • Achieves state-of-the-art performance among 10B-parameter VLMs, matching or exceeding 72B models on 18 benchmarks.
  • Supports 64k context length and arbitrary aspect ratios up to 4K resolution.
  • Offers a "thinking" mechanism that can interrupt generation to prompt for a final answer.
  • Provides both a reasoning-focused model (GLM-4.1V-9B-Thinking) and a base model (GLM-4.1V-9B-Base) for research.

Maintenance & Community

  • Developed by THUDM.
  • Community channels include WeChat and Discord.
  • Paper available on arXiv: 2507.01006.

Licensing & Compatibility

  • Code is released under Apache License 2.0.
  • Models (GLM-4.1V-9B-Thinking and GLM-4.1V-9B-Base) are licensed under the MIT License.
  • MIT license generally permits commercial use and linking with closed-source software.

Limitations & Caveats

The provided inference scripts are primarily for transformers; custom implementation is needed for vLLM to support the "thinking" interruption logic. Video input support requires modifications to the provided scripts.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
127
Star History
962 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.