chatglm.cpp by li-plus

C++ project for ChatGLM & GLM model inference

Created 2 years ago

2,968 stars

Top 16.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Alex Chen

Cofounder of Nexa AI

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

This project provides a pure C++ implementation of the ChatGLM family of large language models, enabling efficient, real-time inference on consumer hardware, including MacBooks. It targets developers and researchers looking to deploy these models locally with optimized CPU and GPU (NVIDIA, Apple Silicon) performance.

How It Works

Leveraging the ggml library, similar to llama.cpp, this project offers accelerated CPU inference through int4/int8 quantization, optimized KV cache, and parallel computing. It supports various ChatGLM models (6B, 2-6B, 3-6B, 4-9B, 4V-9B) and CodeGeeX2, with features like streaming generation and support for P-Tuning v2 and LoRA finetuned models.

Quick Start & Requirements

Install: Clone the repository (git clone --recursive ...), update submodules (git submodule update --init --recursive), install Python dependencies (pip install torch tabulate tqdm transformers accelerate sentencepiece), convert models (python3 chatglm_cpp/convert.py ...), build (cmake -B build && cmake --build build -j --config Release), and run (./build/bin/main -m models/chatglm-ggml.bin -p "你好").
Prerequisites: Python 3, CMake, C++ compiler. Optional acceleration via CUDA (NVIDIA) or Metal (Apple Silicon) requires specific CMake flags (-DGGML_CUDA=ON, -DGGML_METAL=ON).
Resources: Model conversion requires PyTorch. Inference performance varies by hardware; benchmarks are provided for CPU, V100, and M2 Ultra.
Docs: https://github.com/li-plus/chatglm.cpp

Highlighted Details

Supports multiple ChatGLM and CodeGeeX2 model versions.
Offers Python bindings, API servers (LangChain, OpenAI compatible), and a web demo.
Enables runtime conversion of Hugging Face models to GGML format.
Provides detailed performance benchmarks across CPU, CUDA, and Metal backends.

Maintenance & Community

The project is actively maintained and inspired by llama.cpp. Community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and linking with closed-source applications.

Limitations & Caveats

While supporting various hardware and platforms, the README notes that vision encoding for GLM4V is slow on CPU and recommends GPU usage. Performance for Metal (MPS) on Apple Silicon is listed as "N/A" for some quantization types (q5_0, q5_1).

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days