Discover and explore top open-source AI tools and projects—updated daily.
quantumaikrPure C LLM inference for massive context
New!
Top 76.1% on SourcePulse
quantumaikr/quant.cpp
This project addresses the significant memory overhead of Key-Value (KV) caches in Large Language Models (LLMs), which often limits context window size more than model weights. quant.cpp provides a pure C, zero-dependency inference engine focused on aggressive KV cache compression, enabling dramatically longer context lengths on existing hardware with minimal to no quality degradation. It is designed for developers seeking to embed LLM inference into applications or run models with extensive context, offering a highly embeddable, single-header library.
How It Works
The core innovation lies in lossless KV cache compression techniques. Instead of storing KV pairs in FP16, quant.cpp quantizes keys to 4-bit or 3-bit and values to Q4 precision, achieving 3.8x to 6.9x memory reduction. It further employs delta encoding for adjacent keys, akin to video compression, to achieve up to 8.5x compression with a minimal perplexity increase (+1.3%). This approach prioritizes memory efficiency over raw inference speed, allowing models to retain context for hundreds of thousands of tokens.
Quick Start & Requirements
git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)
pip install huggingface_hub
hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf --local-dir models/
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -j 4
For KV compression:
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -k uniform_4b -v q4
Highlighted Details
quant.h, 15.7K LOC, 643KB compile time) with zero build dependencies beyond a C compiler.Maintenance & Community
The repository is maintained by quantumaikr. Specific community channels (e.g., Discord, Slack), detailed roadmaps beyond a v1.3 plan, or notable sponsorships/partnerships are not explicitly detailed in the README.
Licensing & Compatibility
The project's license is not explicitly stated in the provided README. This absence is a critical factor for evaluating adoption, especially for commercial or closed-source applications.
Limitations & Caveats
While GPU backends (CUDA, Metal) are supported and compiling, the project's primary optimizations and performance claims are CPU-centric, making it slower than highly optimized GPU engines like vLLM for raw throughput. Speed improvements are noted as actively in progress. The lack of a stated license poses a significant adoption blocker.
22 hours ago
Inactive
ngxson
Mega4alik
b4rtaz