CLI tool for differentiated KV cache quantization on Apple Silicon
Top 79.5% on sourcepulse
KVSplit enables running larger LLMs with longer context windows on Apple Silicon by applying differentiated precision to the KV cache. It targets users with M1/M2/M3 Macs who want to maximize LLM performance and context length within memory constraints, offering significant memory savings with minimal quality loss.
How It Works
KVSplit leverages the insight that keys in the KV cache are more sensitive to quantization than values. By applying 8-bit precision to keys and 4-bit precision to values (K8V4), it achieves a 59% memory reduction with less than 1% perplexity degradation compared to FP16. This asymmetric quantization strategy optimizes memory usage without sacrificing model quality, and the project is fully optimized for Apple Silicon using Metal.
Quick Start & Requirements
scripts/install_kvsplit.sh
.llama.cpp
integration options.Highlighted Details
Maintenance & Community
The project is actively maintained by dipampaul17. Contributions are welcome via issues or pull requests.
Licensing & Compatibility
Limitations & Caveats
The project is specifically optimized for Apple Silicon Macs and may not offer the same benefits or performance on other architectures. Actual memory savings may vary slightly due to 256B page alignment in llama.cpp
.
2 months ago
Inactive