Discover and explore top open-source AI tools and projects—updated daily.
scrya-comLLM KV cache compression via block-diagonal rotation
New!
Top 58.0% on SourcePulse
Summary
RotorQuant offers advanced KV cache compression for LLMs, targeting engineers and researchers aiming to boost inference efficiency. It employs block-diagonal rotation techniques, providing substantial gains in decode/prefill speeds, reduced memory usage, and competitive quality over methods like TurboQuant, with seamless integration into llama.cpp.
How It Works
The core approach replaces dense transforms with parallelizable block-diagonal rotations (2D Givens for PlanarQuant, 4D quaternions for IsoQuant), exploiting KV cache vector sparsity. This O(d) method decorrelates vectors before scalar quantization. "Deferred quantization" maintains FP16 during prefill, quantizing only during decode, minimizing error and dequantization overhead in optimized attention kernels.
Quick Start & Requirements
Integration via the llama.cpp fork (feature/planarquant-kv-cache) is recommended.
git clone https://github.com/johndpree/llama-cpp-turboquant.git), checkout the specific branch (git checkout feature/planarquant-kv-cache), and build using CMake, enabling CUDA (-DGGML_CUDA=ON) or Metal (-DGGML_METAL=ON) as needed.datasets library are required for perplexity benchmarks.llama-server or llama-bench with specific --cache-type-k and --cache-type-v flags (e.g., iso3, planar3).llama.cpp fork: https://github.com/johndpree/llama-cpp-turboquant.gitpip install -e . && pip install tritonHighlighted Details
Maintenance & Community
Recent commits (April 1, 2026) indicate active development on the llama.cpp integration branch. Key design contributions for IsoQuant and PlanarQuant are attributed to @ParaMind2025. No specific community channels (like Discord or Slack) or formal roadmap links are provided.
Licensing & Compatibility
The project is released under the MIT License. This permissive license allows for broad compatibility, including use in commercial and closed-source applications without significant restrictions.
Limitations & Caveats
The original RotorQuant implementation is research-grade and uses Triton. Production-ready IsoQuant and PlanarQuant backends are integrated into llama.cpp. While quality is competitive, perplexity scores for compressed caches are slightly higher than the FP16 baseline. Performance characteristics vary based on the specific combination of K and V cache quantization types used.
1 week ago
Inactive