Discover and explore top open-source AI tools and projects—updated daily.
DevTechJrLLM inference acceleration via KV cache compression
New!
Top 99.1% on SourcePulse
LLM inference efficiency is addressed by TurboQuant-GPU, a library designed to significantly compress the Key-Value (KV) cache on NVIDIA GPUs. It targets researchers and engineers seeking to reduce memory footprints and accelerate LLM inference, offering substantial compression ratios through advanced quantization techniques.
How It Works
Random orthogonal rotation transforms KV cache coordinates into an approximate Gaussian distribution. This enables optimal Lloyd-Max quantization, achieving high compression with minimal similarity loss (0.98 cosine similarity). Keys are quantized to 2 bits using MSE, augmented by 1-bit QJL bias correction. Values receive 3-bit MSE quantization. Both key and value compression are executed within a single, fused kernel launch per attention head. This approach exploits the post-rotation Gaussian structure inherent to KV caches, offering superior compression (5.02x) compared to general-purpose FP4 formats like MXFP4 and NVFP4.
Quick Start & Requirements
pip install turboquant-gpupip install cuda-tile[tileiras] --extra-index-url https://pypi.nvidia.com (requires CUDA 13.0+ driver). If unavailable or driver is older, functionality falls back to PyTorch.quickstart.ipynb notebook available for installation and usage guidance.Highlighted Details
compress_kv_3bit), key-only compression (compress_keys), value-only compression (compress_values), value decompression (decompress_values), and fused_attention incorporating online softmax and V accumulation.Maintenance & Community
No specific community links (Discord/Slack) or contributor details are provided in the README.
Licensing & Compatibility
Limitations & Caveats
cuTile acceleration support varies by GPU and CUDA driver version; H100 support is noted as pending. The library relies on a PyTorch fallback mechanism when cuTile is not available or compatible with the system's configuration.
3 weeks ago
Inactive