chatglm.cpp  by li-plus

C++ project for ChatGLM & GLM model inference

created 2 years ago
2,980 stars

Top 16.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a pure C++ implementation of the ChatGLM family of large language models, enabling efficient, real-time inference on consumer hardware, including MacBooks. It targets developers and researchers looking to deploy these models locally with optimized CPU and GPU (NVIDIA, Apple Silicon) performance.

How It Works

Leveraging the ggml library, similar to llama.cpp, this project offers accelerated CPU inference through int4/int8 quantization, optimized KV cache, and parallel computing. It supports various ChatGLM models (6B, 2-6B, 3-6B, 4-9B, 4V-9B) and CodeGeeX2, with features like streaming generation and support for P-Tuning v2 and LoRA finetuned models.

Quick Start & Requirements

  • Install: Clone the repository (git clone --recursive ...), update submodules (git submodule update --init --recursive), install Python dependencies (pip install torch tabulate tqdm transformers accelerate sentencepiece), convert models (python3 chatglm_cpp/convert.py ...), build (cmake -B build && cmake --build build -j --config Release), and run (./build/bin/main -m models/chatglm-ggml.bin -p "你好").
  • Prerequisites: Python 3, CMake, C++ compiler. Optional acceleration via CUDA (NVIDIA) or Metal (Apple Silicon) requires specific CMake flags (-DGGML_CUDA=ON, -DGGML_METAL=ON).
  • Resources: Model conversion requires PyTorch. Inference performance varies by hardware; benchmarks are provided for CPU, V100, and M2 Ultra.
  • Docs: https://github.com/li-plus/chatglm.cpp

Highlighted Details

  • Supports multiple ChatGLM and CodeGeeX2 model versions.
  • Offers Python bindings, API servers (LangChain, OpenAI compatible), and a web demo.
  • Enables runtime conversion of Hugging Face models to GGML format.
  • Provides detailed performance benchmarks across CPU, CUDA, and Metal backends.

Maintenance & Community

The project is actively maintained and inspired by llama.cpp. Community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0, permitting commercial use and linking with closed-source applications.

Limitations & Caveats

While supporting various hardware and platforms, the README notes that vision encoding for GLM4V is slow on CPU and recommends GPU usage. Performance for Metal (MPS) on Apple Silicon is listed as "N/A" for some quantization types (q5_0, q5_1).

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 90 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), Eugene Yan Eugene Yan(AI Scientist at AWS), and
2 more.

starcoder.cpp by bigcode-project

0.2%
456
C++ example for StarCoder inference
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.