KIVI  by jy-yuan

Research paper implementation for KV cache quantization

created 1 year ago
312 stars

Top 87.5% on sourcepulse

GitHubView on GitHub
Project Summary

KIVI is a plug-and-play 2-bit quantization algorithm for Large Language Model (LLM) KV caches, designed to reduce memory usage and increase inference throughput without fine-tuning. It targets researchers and engineers working with LLMs who need to optimize performance for long contexts or larger batch sizes.

How It Works

KIVI employs an asymmetric quantization scheme, quantizing the key cache per-channel and the value cache per-token to 2 bits. This approach is hardware-friendly and aims to maintain comparable quality to full-precision KV caches. The method is designed to be integrated seamlessly into existing LLM architectures.

Quick Start & Requirements

  • Install via pip: pip install -e .
  • Install CUDA implementation: cd quant && pip install -e .
  • Requires Python 3.10+, PyTorch, and Hugging Face Transformers.
  • CUDA implementation requires compilation.
  • Official documentation and examples are available in the repository.

Highlighted Details

  • Achieves 2.6x peak memory reduction for KV caches.
  • Increases LLM inference throughput by 2.35x to 3.47x.
  • Supports Llama, Falcon, and Mistral model families.
  • Recent updates include support for GQA and Hugging Face Transformers KV Cache quantization.
  • Beta optimizations for reduced latency are available in the develop branch.

Maintenance & Community

  • The project was accepted to ICML 2024.
  • Contributions are welcomed via issues or pull requests.
  • No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

  • Released under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • The CUDA implementation requires separate installation and compilation.
  • Some optimizations are in beta and may require reinstallation.
Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.