turboquant_plus  by TheTom

LLM KV cache compression for efficient local inference

Created 6 days ago

New!

616 stars

Top 53.4% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> TurboQuant+ addresses the significant memory overhead of LLM KV caches for local inference. It offers advanced compression techniques, enabling larger models and longer contexts on consumer hardware with minimal performance degradation. The project targets engineers and researchers seeking efficient LLM deployment.

How It Works

The core approach compresses transformer KV caches using a two-stage process: PolarQuant (b-1 bits) involving random rotation and scalar quantization, followed by QJL (1 bit) for unbiased inner product correction. This yields a CompressedVector representation. This method achieves high compression ratios (up to 4.6x) while maintaining near-zero speed penalties and high fidelity compared to uncompressed or standard quantized caches.

Quick Start & Requirements

  • Python Prototype: Clone the repo, create a virtual environment, and install with pip install -e ".[dev]". Run tests with python3 -m pytest tests/ -v.
  • llama.cpp Integration: Clone the llama-cpp-turboquant fork. Build with Metal (Apple Silicon) or CUDA (NVIDIA) using CMake.
    • Metal: cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
    • CUDA: cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j (untested)
  • Prerequisites: Python >= 3.10, NumPy >= 1.24, SciPy >= 1.10, CMake, C/C++ compiler, Xcode Command Line Tools (for Metal). Optional: torch, transformers, accelerate.
  • Docs/Demos: benchmarks/demo.py, benchmarks/validate_real_model.py.
  • Papers: TurboQuant (arXiv 2504.19874), PolarQuant (arXiv 2502.02617), QJL (arXiv 2406.03482).

Highlighted Details

  • Achieves 4.6x KV cache compression (turbo3 type) with q8_0 speed parity on Apple Silicon (M5 Max: 2747 vs 2694 tok/s prefill).
  • Quality metrics show minimal degradation: PPL beats q8_0 (-1.17% on CUDA, +1.1% on Metal).
  • Features like 4-mag LUT (+38-45% decode at long context) and Sparse V dequantization (+22.8% decode at 32K) enhance performance.
  • Needle-In-A-Haystack (NIAH) retrieval remains comparable to q8_0 (80-100% through 32K).
  • Rotation Gaussianization validated on real Qwen3 KV tensors, reducing kurtosis from 900 to 2.9.

Maintenance & Community

The project is v1 complete, speed-optimized, and community-tested with 511+ Python tests and 100% code coverage. A C port is integrated into a llama.cpp fork with Metal GPU kernels. Over 10 testers have contributed across Mac and NVIDIA hardware. The roadmap indicates ongoing work on CUDA backend, benchmark hardening, and advanced extensions.

Licensing & Compatibility

Licensed under the Apache License 2.0, permitting commercial use and integration into closed-source projects.

Limitations & Caveats

The CUDA backend is still under development. Advanced "Plus" extensions like adaptive bit allocation and temporal decay are in experimental branches or planned. Integration requires using a specific llama.cpp fork, and upstream coordination is ongoing. The turbo4 variant is noted as broken and requires updates.

Health Check
Last Commit

16 hours ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
44
Star History
1,119 stars in the last 6 days

Explore Similar Projects

Feedback? Help us improve.