KVarN  by huawei-csl

Boost LLM context and throughput with KV-cache quantization

Created 1 week ago

New!

351 stars

Top 79.2% on SourcePulse

GitHubView on GitHub
Project Summary

KVarN is a native KV-cache quantization backend for vLLM, targeting agentic and long-context workloads. It overcomes the typical trade-offs of KV-cache quantization by offering 3-5x more capacity and ~1.3x higher throughput than FP16, while preserving FP16-level accuracy. This enables significantly longer contexts and increased serving concurrency without quality degradation.

How It Works

KVarN quantizes KV cache tiles through a four-stage process: initial FP16 cache, orthonormal Hadamard rotation to spread outliers, iterative variance normalization (Sinkhorn-like) to equalize variance and reduce error, and finally, asymmetric round-to-nearest quantization with scale folding. The kvarn_k4v2_g128 preset (4-bit keys, 2-bit values) is optimized to match FP16 accuracy and exceed FP16 throughput, a key differentiator from methods like TurboQuant that sacrifice performance for capacity.

Quick Start & Requirements

Install KVarN by cloning the repository and running VLLM_USE_PRECOMPILED=1 pip install -e .. It requires vLLM (v0.22.0) and Triton kernels (JIT-compiled at runtime). Enable KVarN by setting kv_cache_dtype="kvarn_k4v2_g128" and block_size=128 during LLM initialization or serving, using float16 compute. Full KV-cache capacity may necessitate adjusting memory profiler settings (e.g., VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0).

Highlighted Details

  • Provides 3-5x greater KV-cache capacity.
  • Achieves up to ~1.3x FP16 throughput.
  • Maintains FP16-level accuracy.
  • Outperforms TurboQuant in throughput (~2.4x) and accuracy.
  • Offers calibration-free, plug-and-play integration with vLLM.

Licensing & Compatibility

KVarN is released under the permissive Apache 2.0 License and is built on vLLM v0.22.0, supporting commercial use and integration.

Limitations & Caveats

The current implementation features a fixed block_size of 128. Triton kernels are JIT-compiled at runtime. Achieving maximum KV-cache capacity might require specific memory configuration adjustments.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
3
Star History
352 stars in the last 10 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 13 hours ago
Feedback? Help us improve.