KVarN by huawei-csl

Boost LLM context and throughput with KV-cache quantization

Created 1 month ago

440 stars

Top 67.2% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Luis Capelo

Cofounder of Lightning AI

Project Summary

KVarN is a native KV-cache quantization backend for vLLM, targeting agentic and long-context workloads. It overcomes the typical trade-offs of KV-cache quantization by offering 3-5x more capacity and ~1.3x higher throughput than FP16, while preserving FP16-level accuracy. This enables significantly longer contexts and increased serving concurrency without quality degradation.

How It Works

KVarN quantizes KV cache tiles through a four-stage process: initial FP16 cache, orthonormal Hadamard rotation to spread outliers, iterative variance normalization (Sinkhorn-like) to equalize variance and reduce error, and finally, asymmetric round-to-nearest quantization with scale folding. The kvarn_k4v2_g128 preset (4-bit keys, 2-bit values) is optimized to match FP16 accuracy and exceed FP16 throughput, a key differentiator from methods like TurboQuant that sacrifice performance for capacity.

Quick Start & Requirements

Install KVarN by cloning the repository and running VLLM_USE_PRECOMPILED=1 pip install -e .. It requires vLLM (v0.22.0) and Triton kernels (JIT-compiled at runtime). Enable KVarN by setting kv_cache_dtype="kvarn_k4v2_g128" and block_size=128 during LLM initialization or serving, using float16 compute. Full KV-cache capacity may necessitate adjusting memory profiler settings (e.g., VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0).

Highlighted Details

Provides 3-5x greater KV-cache capacity.
Achieves up to ~1.3x FP16 throughput.
Maintains FP16-level accuracy.
Outperforms TurboQuant in throughput (~2.4x) and accuracy.
Offers calibration-free, plug-and-play integration with vLLM.

Licensing & Compatibility

KVarN is released under the permissive Apache 2.0 License and is built on vLLM v0.22.0, supporting commercial use and integration.

Limitations & Caveats

The current implementation features a fixed block_size of 128. Triton kernels are JIT-compiled at runtime. Achieving maximum KV-cache capacity might require specific memory configuration adjustments.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

20 stars in the last 30 days