Research paper implementation for KV cache quantization
Top 87.5% on sourcepulse
KIVI is a plug-and-play 2-bit quantization algorithm for Large Language Model (LLM) KV caches, designed to reduce memory usage and increase inference throughput without fine-tuning. It targets researchers and engineers working with LLMs who need to optimize performance for long contexts or larger batch sizes.
How It Works
KIVI employs an asymmetric quantization scheme, quantizing the key cache per-channel and the value cache per-token to 2 bits. This approach is hardware-friendly and aims to maintain comparable quality to full-precision KV caches. The method is designed to be integrated seamlessly into existing LLM architectures.
Quick Start & Requirements
pip install -e .
cd quant && pip install -e .
Highlighted Details
develop
branch.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
6 months ago
1 day