KVSplit by dipampaul17

CLI tool for differentiated KV cache quantization on Apple Silicon

Created 8 months ago

362 stars

Top 77.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Benjamin Bolte

Cofounder of K-Scale Labs

Project Summary

KVSplit enables running larger LLMs with longer context windows on Apple Silicon by applying differentiated precision to the KV cache. It targets users with M1/M2/M3 Macs who want to maximize LLM performance and context length within memory constraints, offering significant memory savings with minimal quality loss.

How It Works

KVSplit leverages the insight that keys in the KV cache are more sensitive to quantization than values. By applying 8-bit precision to keys and 4-bit precision to values (K8V4), it achieves a 59% memory reduction with less than 1% perplexity degradation compared to FP16. This asymmetric quantization strategy optimizes memory usage without sacrificing model quality, and the project is fully optimized for Apple Silicon using Metal.

Quick Start & Requirements

Install: Clone the repository and run scripts/install_kvsplit.sh.
Prerequisites: macOS (Apple Silicon), Homebrew, Xcode Command Line Tools.
Setup: The installer offers flexible Python environment and llama.cpp integration options.
Docs: https://github.com/dipampaul17/KVSplit

Highlighted Details

Achieves up to 72% memory reduction (K4V4) and 59% (K8V4) with minimal quality loss.
K8V4 configuration offers a 5.7% inference speedup over FP16.
Includes comprehensive benchmarking tools for memory, speed, and quality.
Provides publication-quality visualization scripts for results.

Maintenance & Community

The project is actively maintained by dipampaul17. Contributions are welcome via issues or pull requests.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive MIT license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

The project is specifically optimized for Apple Silicon Macs and may not offer the same benefits or performance on other architectures. Actual memory savings may vary slightly due to 256B page alignment in llama.cpp.

KVSplit by dipampaul17

Explore Similar Projects

OneClickLLAMA by neavo

local-gemma by huggingface

KIVI by jy-yuan

Nanoflow by efeslab

esp32-llm by DaveBben

prima.cpp by Lizonghang

xFasterTransformer by intel

LiteRT-LM by google-ai-edge

intel-extension-for-pytorch by intel

MiniCPM by OpenBMB

ipex-llm by intel

unsloth by unslothai