C++ project for ChatGLM & GLM model inference
Top 16.4% on sourcepulse
This project provides a pure C++ implementation of the ChatGLM family of large language models, enabling efficient, real-time inference on consumer hardware, including MacBooks. It targets developers and researchers looking to deploy these models locally with optimized CPU and GPU (NVIDIA, Apple Silicon) performance.
How It Works
Leveraging the ggml
library, similar to llama.cpp
, this project offers accelerated CPU inference through int4/int8 quantization, optimized KV cache, and parallel computing. It supports various ChatGLM models (6B, 2-6B, 3-6B, 4-9B, 4V-9B) and CodeGeeX2, with features like streaming generation and support for P-Tuning v2 and LoRA finetuned models.
Quick Start & Requirements
git clone --recursive ...
), update submodules (git submodule update --init --recursive
), install Python dependencies (pip install torch tabulate tqdm transformers accelerate sentencepiece
), convert models (python3 chatglm_cpp/convert.py ...
), build (cmake -B build && cmake --build build -j --config Release
), and run (./build/bin/main -m models/chatglm-ggml.bin -p "你好"
).-DGGML_CUDA=ON
, -DGGML_METAL=ON
).Highlighted Details
Maintenance & Community
The project is actively maintained and inspired by llama.cpp
. Community interaction channels are not explicitly listed in the README.
Licensing & Compatibility
The project is licensed under the Apache License 2.0, permitting commercial use and linking with closed-source applications.
Limitations & Caveats
While supporting various hardware and platforms, the README notes that vision encoding for GLM4V is slow on CPU and recommends GPU usage. Performance for Metal (MPS) on Apple Silicon is listed as "N/A" for some quantization types (q5_0, q5_1).
1 year ago
1 week