KIVI  by jy-yuan

Research paper implementation for KV cache quantization

Created 1 year ago
323 stars

Top 84.1% on SourcePulse

GitHubView on GitHub
Project Summary

KIVI is a plug-and-play 2-bit quantization algorithm for Large Language Model (LLM) KV caches, designed to reduce memory usage and increase inference throughput without fine-tuning. It targets researchers and engineers working with LLMs who need to optimize performance for long contexts or larger batch sizes.

How It Works

KIVI employs an asymmetric quantization scheme, quantizing the key cache per-channel and the value cache per-token to 2 bits. This approach is hardware-friendly and aims to maintain comparable quality to full-precision KV caches. The method is designed to be integrated seamlessly into existing LLM architectures.

Quick Start & Requirements

  • Install via pip: pip install -e .
  • Install CUDA implementation: cd quant && pip install -e .
  • Requires Python 3.10+, PyTorch, and Hugging Face Transformers.
  • CUDA implementation requires compilation.
  • Official documentation and examples are available in the repository.

Highlighted Details

  • Achieves 2.6x peak memory reduction for KV caches.
  • Increases LLM inference throughput by 2.35x to 3.47x.
  • Supports Llama, Falcon, and Mistral model families.
  • Recent updates include support for GQA and Hugging Face Transformers KV Cache quantization.
  • Beta optimizations for reduced latency are available in the develop branch.

Maintenance & Community

  • The project was accepted to ICML 2024.
  • Contributions are welcomed via issues or pull requests.
  • No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

  • Released under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • The CUDA implementation requires separate installation and compilation.
  • Some optimizations are in beta and may require reinstallation.
Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

0.5%
424
Code for a NeurIPS 2024 research paper on LLM quantization
Created 1 year ago
Updated 9 months ago
Feedback? Help us improve.