turboquant  by 0xSero

LLM inference accelerated via KV cache compression

Created 2 weeks ago

New!

933 stars

Top 39.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project addresses the significant memory overhead of Key-Value (KV) caches in Large Language Model (LLM) inference. It provides an implementation of TurboQuant, a near-optimal KV cache quantization technique (3-bit keys, 2-bit values), integrated with vLLM and optimized using Triton kernels. This enables LLM inference to handle significantly longer contexts and increases token capacity, benefiting researchers and engineers focused on optimizing LLM deployment and performance.

How It Works

TurboQuant employs a multi-stage compression strategy: random orthogonal rotation spreads information across dimensions, followed by Lloyd-Max optimal scalar quantization on Beta-distributed values. A QJL projection handles residual sign bits, and group quantization is applied to values with per-group scales. Finally, bit-packing efficiently stores the quantized data. This approach aims to minimize quantization error, particularly for keys, while drastically reducing KV cache memory footprint, leading to substantial gains in maximum token capacity and inference throughput.

Quick Start & Requirements

  • Install: pip install -e .
  • Prerequisites: vLLM (0.18.0), PyTorch (2.10), CUDA (12.8), Python 3.12. Tested on NVIDIA RTX 3090 (24GB) and RTX 5090 (32GB) GPUs.
  • Resource Footprint: Requires significant GPU memory for LLM weights and KV cache, especially for large models and long contexts. Running benchmarks like proof.py necessitates multi-GPU setups (e.g., 4x RTX 3090).
  • Links: Paper (arXiv:2504.19874), Validation scripts (validate_paper.py, audit_claims.py).

Highlighted Details

  • Performance: Achieved +5.7% Prefill tok/s and +3.1% Decode tok/s on an RTX 5090.
  • Memory Savings: Demonstrated freeing 30.0 GB of KV cache across 4 GPUs and enabling 2.0x maximum token capacity.
  • Quality: Key compression is near-lossless (cos_sim=1.000). 2-bit value quantization (cos_sim=0.940) is the primary quality bottleneck; 4-bit values (cos_sim=0.997) are recommended for sensitive applications.
  • Validation: All 35 internal tests pass, covering modular architecture, core quantizer logic, and paper theorem validation.

Limitations & Caveats

Prefill operations still utilize the standard paged KV cache allocation. TurboQuant only compresses KV cache entries for full-attention layers, leaving linear-attention and Mamba layers uncompressed. The 2-bit value quantization introduces a quality degradation (cos_sim=0.940), necessitating the use of 4-bit values for higher fidelity. The hybrid decode mechanism currently dequantizes all compressed history to float32 per step, and benefits are reduced on Mixture-of-Experts (MoE) models with substantial linear layers. An adversarial audit also flagged several claims from the associated paper as potentially misleading or based on trivial test cases.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
7
Star History
945 stars in the last 17 days

Explore Similar Projects

Feedback? Help us improve.