quant.cpp by quantumaikr

Pure C LLM inference for massive context

Created 3 months ago

395 stars

Top 72.6% on SourcePulse

Project Summary

quantumaikr/quant.cpp

This project addresses the significant memory overhead of Key-Value (KV) caches in Large Language Models (LLMs), which often limits context window size more than model weights. quant.cpp provides a pure C, zero-dependency inference engine focused on aggressive KV cache compression, enabling dramatically longer context lengths on existing hardware with minimal to no quality degradation. It is designed for developers seeking to embed LLM inference into applications or run models with extensive context, offering a highly embeddable, single-header library.

How It Works

The core innovation lies in lossless KV cache compression techniques. Instead of storing KV pairs in FP16, quant.cpp quantizes keys to 4-bit or 3-bit and values to Q4 precision, achieving 3.8x to 6.9x memory reduction. It further employs delta encoding for adjacent keys, akin to video compression, to achieve up to 8.5x compression with a minimal perplexity increase (+1.3%). This approach prioritizes memory efficiency over raw inference speed, allowing models to retain context for hundreds of thousands of tokens.

Quick Start & Requirements

Build:

git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)

Download Model:

pip install huggingface_hub
hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf --local-dir models/

Run:

./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -j 4

For KV compression:

./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -k uniform_4b -v q4

Prerequisites: A C compiler, CMake, and Python (for model downloads).
Hardware: CPU-focused, with experimental support for Metal, CUDA, Vulkan. NEON and AVX2 optimizations are production-ready.
Links: API Docs, WASM Demo, Roadmap

Highlighted Details

KV Compression: Achieves 3.8x-6.9x KV cache reduction with 0.0% perplexity loss (4-bit K + Q4 V) or up to 8.5x with 1.3% loss (delta 3-bit K + Q4 V).
Embeddability: Available as a single-header library (quant.h, 15.7K LOC, 643KB compile time) with zero build dependencies beyond a C compiler.
WASM Support: Compiles to a 192KB binary for client-side browser inference.
Architecture Support: GGUF format, including Llama, Qwen, Gemma, and Gemma 4's hybrid MoE architecture.
Zero Dependencies: Core library requires only a standard C compiler.

Maintenance & Community

The repository is maintained by quantumaikr. Specific community channels (e.g., Discord, Slack), detailed roadmaps beyond a v1.3 plan, or notable sponsorships/partnerships are not explicitly detailed in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the provided README. This absence is a critical factor for evaluating adoption, especially for commercial or closed-source applications.

Limitations & Caveats

While GPU backends (CUDA, Metal) are supported and compiling, the project's primary optimizations and performance claims are CPU-centric, making it slower than highly optimized GPU engines like vLLM for raw throughput. Speed improvements are noted as actively in progress. The lack of a stated license poses a significant adoption blocker.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days