rotorquant by scrya-com

LLM KV cache compression via block-diagonal rotation

Created 3 months ago

1,025 stars

Top 35.8% on SourcePulse

Project Summary

Summary

RotorQuant offers advanced KV cache compression for LLMs, targeting engineers and researchers aiming to boost inference efficiency. It employs block-diagonal rotation techniques, providing substantial gains in decode/prefill speeds, reduced memory usage, and competitive quality over methods like TurboQuant, with seamless integration into llama.cpp.

How It Works

The core approach replaces dense transforms with parallelizable block-diagonal rotations (2D Givens for PlanarQuant, 4D quaternions for IsoQuant), exploiting KV cache vector sparsity. This O(d) method decorrelates vectors before scalar quantization. "Deferred quantization" maintains FP16 during prefill, quantizing only during decode, minimizing error and dequantization overhead in optimized attention kernels.

Quick Start & Requirements

Integration via the llama.cpp fork (feature/planarquant-kv-cache) is recommended.

Install: Clone the repository (git clone https://github.com/johndpree/llama-cpp-turboquant.git), checkout the specific branch (git checkout feature/planarquant-kv-cache), and build using CMake, enabling CUDA (-DGGML_CUDA=ON) or Metal (-DGGML_METAL=ON) as needed.
Prerequisites: CUDA toolkit for NVIDIA GPUs, or Metal for Apple Silicon. Python and the datasets library are required for perplexity benchmarks.
Usage: Run llama-server or llama-bench with specific --cache-type-k and --cache-type-v flags (e.g., iso3, planar3).
Links:
- llama.cpp fork: https://github.com/johndpree/llama-cpp-turboquant.git
- Python/Triton integration: pip install -e . && pip install triton

Highlighted Details

Performance: Outperforms TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, and uses 44x fewer parameters.
Compression: Achieves up to 10.3x compression (3-bit symmetric) and 5.1x compression (K-only).
VRAM Savings: Demonstrates significant memory reduction, e.g., saving 260MB at 8K context with 3-bit symmetric compression.
Architecture: Evolved from complex Clifford rotors (RotorQuant) to efficient IsoQuant (4D quaternion) and PlanarQuant (2D Givens) block rotations.

Maintenance & Community

Recent commits (April 1, 2026) indicate active development on the llama.cpp integration branch. Key design contributions for IsoQuant and PlanarQuant are attributed to @ParaMind2025. No specific community channels (like Discord or Slack) or formal roadmap links are provided.

Licensing & Compatibility

The project is released under the MIT License. This permissive license allows for broad compatibility, including use in commercial and closed-source applications without significant restrictions.

Limitations & Caveats

The original RotorQuant implementation is research-grade and uses Triton. Production-ready IsoQuant and PlanarQuant backends are integrated into llama.cpp. While quality is competitive, perplexity scores for compressed caches are slightly higher than the FP16 baseline. Performance characteristics vary based on the specific combination of K and V cache quantization types used.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days