KVSplit  by dipampaul17

CLI tool for differentiated KV cache quantization on Apple Silicon

created 2 months ago
356 stars

Top 79.5% on sourcepulse

GitHubView on GitHub
Project Summary

KVSplit enables running larger LLMs with longer context windows on Apple Silicon by applying differentiated precision to the KV cache. It targets users with M1/M2/M3 Macs who want to maximize LLM performance and context length within memory constraints, offering significant memory savings with minimal quality loss.

How It Works

KVSplit leverages the insight that keys in the KV cache are more sensitive to quantization than values. By applying 8-bit precision to keys and 4-bit precision to values (K8V4), it achieves a 59% memory reduction with less than 1% perplexity degradation compared to FP16. This asymmetric quantization strategy optimizes memory usage without sacrificing model quality, and the project is fully optimized for Apple Silicon using Metal.

Quick Start & Requirements

  • Install: Clone the repository and run scripts/install_kvsplit.sh.
  • Prerequisites: macOS (Apple Silicon), Homebrew, Xcode Command Line Tools.
  • Setup: The installer offers flexible Python environment and llama.cpp integration options.
  • Docs: https://github.com/dipampaul17/KVSplit

Highlighted Details

  • Achieves up to 72% memory reduction (K4V4) and 59% (K8V4) with minimal quality loss.
  • K8V4 configuration offers a 5.7% inference speedup over FP16.
  • Includes comprehensive benchmarking tools for memory, speed, and quality.
  • Provides publication-quality visualization scripts for results.

Maintenance & Community

The project is actively maintained by dipampaul17. Contributions are welcome via issues or pull requests.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration with closed-source applications.

Limitations & Caveats

The project is specifically optimized for Apple Silicon Macs and may not offer the same benefits or performance on other architectures. Actual memory savings may vary slightly due to 256B page alignment in llama.cpp.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
357 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.