Discover and explore top open-source AI tools and projects—updated daily.
huawei-cslBoost LLM context and throughput with KV-cache quantization
New!
Top 79.2% on SourcePulse
KVarN is a native KV-cache quantization backend for vLLM, targeting agentic and long-context workloads. It overcomes the typical trade-offs of KV-cache quantization by offering 3-5x more capacity and ~1.3x higher throughput than FP16, while preserving FP16-level accuracy. This enables significantly longer contexts and increased serving concurrency without quality degradation.
How It Works
KVarN quantizes KV cache tiles through a four-stage process: initial FP16 cache, orthonormal Hadamard rotation to spread outliers, iterative variance normalization (Sinkhorn-like) to equalize variance and reduce error, and finally, asymmetric round-to-nearest quantization with scale folding. The kvarn_k4v2_g128 preset (4-bit keys, 2-bit values) is optimized to match FP16 accuracy and exceed FP16 throughput, a key differentiator from methods like TurboQuant that sacrifice performance for capacity.
Quick Start & Requirements
Install KVarN by cloning the repository and running VLLM_USE_PRECOMPILED=1 pip install -e .. It requires vLLM (v0.22.0) and Triton kernels (JIT-compiled at runtime). Enable KVarN by setting kv_cache_dtype="kvarn_k4v2_g128" and block_size=128 during LLM initialization or serving, using float16 compute. Full KV-cache capacity may necessitate adjusting memory profiler settings (e.g., VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0).
Highlighted Details
Licensing & Compatibility
KVarN is released under the permissive Apache 2.0 License and is built on vLLM v0.22.0, supporting commercial use and integration.
Limitations & Caveats
The current implementation features a fixed block_size of 128. Triton kernels are JIT-compiled at runtime. Achieving maximum KV-cache capacity might require specific memory configuration adjustments.
4 days ago
Inactive
casper-hansen
ztxz16