turboquant-gpu by DevTechJr

LLM inference acceleration via KV cache compression

Created 3 weeks ago

New!

254 stars

Top 99.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

LLM inference efficiency is addressed by TurboQuant-GPU, a library designed to significantly compress the Key-Value (KV) cache on NVIDIA GPUs. It targets researchers and engineers seeking to reduce memory footprints and accelerate LLM inference, offering substantial compression ratios through advanced quantization techniques.

How It Works

Random orthogonal rotation transforms KV cache coordinates into an approximate Gaussian distribution. This enables optimal Lloyd-Max quantization, achieving high compression with minimal similarity loss (0.98 cosine similarity). Keys are quantized to 2 bits using MSE, augmented by 1-bit QJL bias correction. Values receive 3-bit MSE quantization. Both key and value compression are executed within a single, fused kernel launch per attention head. This approach exploits the post-rotation Gaussian structure inherent to KV caches, offering superior compression (5.02x) compared to general-purpose FP4 formats like MXFP4 and NVFP4.

Quick Start & Requirements

Primary install: pip install turboquant-gpu
Optional cuTile GPU kernel acceleration: pip install cuda-tile[tileiras] --extra-index-url https://pypi.nvidia.com (requires CUDA 13.0+ driver). If unavailable or driver is older, functionality falls back to PyTorch.
Prerequisites: NVIDIA GPU. CUDA toolkit recommended for cuTile acceleration.
Links: quickstart.ipynb notebook available for installation and usage guidance.

Highlighted Details

Compression Ratio: Achieves a claimed 5.02x KV cache compression, significantly outperforming NVIDIA's FP4 (3.76x for MXFP4, 3.56x for NVFP4) by leveraging KV cache-specific properties.
Kernel Portability: Utilizes cuTile kernels for cross-architecture acceleration on NVIDIA GPUs, with an automatic fallback to PyTorch kernels if cuTile is unavailable or incompatible.
Fused Kernels: Implements optimized kernels for fused K+V compression (compress_kv_3bit), key-only compression (compress_keys), value-only compression (compress_values), value decompression (decompress_values), and fused_attention incorporating online softmax and V accumulation.

Maintenance & Community

No specific community links (Discord/Slack) or contributor details are provided in the README.

Licensing & Compatibility

License: MIT.
Compatibility: This permissive license generally allows for commercial use and linking within closed-source projects.

Limitations & Caveats

cuTile acceleration support varies by GPU and CUDA driver version; H100 support is noted as pending. The library relies on a PyTorch fallback mechanism when cuTile is not available or compatible with the system's configuration.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

254 stars in the last 22 days