turboquant by 0xSero

LLM inference accelerated via KV cache compression

Created 3 months ago

1,652 stars

Top 24.6% on SourcePulse

Project Summary

This project addresses the significant memory overhead of Key-Value (KV) caches in Large Language Model (LLM) inference. It provides an implementation of TurboQuant, a near-optimal KV cache quantization technique (3-bit keys, 2-bit values), integrated with vLLM and optimized using Triton kernels. This enables LLM inference to handle significantly longer contexts and increases token capacity, benefiting researchers and engineers focused on optimizing LLM deployment and performance.

How It Works

TurboQuant employs a multi-stage compression strategy: random orthogonal rotation spreads information across dimensions, followed by Lloyd-Max optimal scalar quantization on Beta-distributed values. A QJL projection handles residual sign bits, and group quantization is applied to values with per-group scales. Finally, bit-packing efficiently stores the quantized data. This approach aims to minimize quantization error, particularly for keys, while drastically reducing KV cache memory footprint, leading to substantial gains in maximum token capacity and inference throughput.

Quick Start & Requirements

Install: pip install -e .
Prerequisites: vLLM (0.18.0), PyTorch (2.10), CUDA (12.8), Python 3.12. Tested on NVIDIA RTX 3090 (24GB) and RTX 5090 (32GB) GPUs.
Resource Footprint: Requires significant GPU memory for LLM weights and KV cache, especially for large models and long contexts. Running benchmarks like proof.py necessitates multi-GPU setups (e.g., 4x RTX 3090).
Links: Paper (arXiv:2504.19874), Validation scripts (validate_paper.py, audit_claims.py).

Highlighted Details

Performance: Achieved +5.7% Prefill tok/s and +3.1% Decode tok/s on an RTX 5090.
Memory Savings: Demonstrated freeing 30.0 GB of KV cache across 4 GPUs and enabling 2.0x maximum token capacity.
Quality: Key compression is near-lossless (cos_sim=1.000). 2-bit value quantization (cos_sim=0.940) is the primary quality bottleneck; 4-bit values (cos_sim=0.997) are recommended for sensitive applications.
Validation: All 35 internal tests pass, covering modular architecture, core quantizer logic, and paper theorem validation.

Limitations & Caveats

Prefill operations still utilize the standard paged KV cache allocation. TurboQuant only compresses KV cache entries for full-attention layers, leaving linear-attention and Mamba layers uncompressed. The 2-bit value quantization introduces a quality degradation (cos_sim=0.940), necessitating the use of 4-bit values for higher fidelity. The hybrid decode mechanism currently dequantizes all compressed history to float32 per step, and benefits are reduced on Mixture-of-Experts (MoE) models with substantial linear layers. An adversarial audit also flagged several claims from the associated paper as potentially misleading or based on trivial test cases.

turboquant by 0xSero

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

Awesome-KV-Cache-Compression by October2001

KVarN by huawei-csl

turboquant-gpu by DevTechJr

DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 by albond

KVQuant by SqueezeAILab

KIVI by jy-yuan

omniserve by mit-han-lab

buun-llama-cpp by spiritbuun

OSCAR by FutureMLS-Lab

triattention by WeianMao

rotorquant by scrya-com