Discover and explore top open-source AI tools and projects—updated daily.
0xSeroLLM inference accelerated via KV cache compression
New!
Top 39.0% on SourcePulse
This project addresses the significant memory overhead of Key-Value (KV) caches in Large Language Model (LLM) inference. It provides an implementation of TurboQuant, a near-optimal KV cache quantization technique (3-bit keys, 2-bit values), integrated with vLLM and optimized using Triton kernels. This enables LLM inference to handle significantly longer contexts and increases token capacity, benefiting researchers and engineers focused on optimizing LLM deployment and performance.
How It Works
TurboQuant employs a multi-stage compression strategy: random orthogonal rotation spreads information across dimensions, followed by Lloyd-Max optimal scalar quantization on Beta-distributed values. A QJL projection handles residual sign bits, and group quantization is applied to values with per-group scales. Finally, bit-packing efficiently stores the quantized data. This approach aims to minimize quantization error, particularly for keys, while drastically reducing KV cache memory footprint, leading to substantial gains in maximum token capacity and inference throughput.
Quick Start & Requirements
pip install -e .proof.py necessitates multi-GPU setups (e.g., 4x RTX 3090).validate_paper.py, audit_claims.py).Highlighted Details
Limitations & Caveats
Prefill operations still utilize the standard paged KV cache allocation. TurboQuant only compresses KV cache entries for full-attention layers, leaving linear-attention and Mamba layers uncompressed. The 2-bit value quantization introduces a quality degradation (cos_sim=0.940), necessitating the use of 4-bit values for higher fidelity. The hybrid decode mechanism currently dequantizes all compressed history to float32 per step, and benefits are reduced on Mixture-of-Experts (MoE) models with substantial linear layers. An adversarial audit also flagged several claims from the associated paper as potentially misleading or based on trivial test cases.
2 weeks ago
Inactive