KIVI by jy-yuan

Research paper implementation for KV cache quantization

Created 1 year ago

347 stars

Top 80.0% on SourcePulse

Project Summary

KIVI is a plug-and-play 2-bit quantization algorithm for Large Language Model (LLM) KV caches, designed to reduce memory usage and increase inference throughput without fine-tuning. It targets researchers and engineers working with LLMs who need to optimize performance for long contexts or larger batch sizes.

How It Works

KIVI employs an asymmetric quantization scheme, quantizing the key cache per-channel and the value cache per-token to 2 bits. This approach is hardware-friendly and aims to maintain comparable quality to full-precision KV caches. The method is designed to be integrated seamlessly into existing LLM architectures.

Quick Start & Requirements

Install via pip: pip install -e .
Install CUDA implementation: cd quant && pip install -e .
Requires Python 3.10+, PyTorch, and Hugging Face Transformers.
CUDA implementation requires compilation.
Official documentation and examples are available in the repository.

Highlighted Details

Achieves 2.6x peak memory reduction for KV caches.
Increases LLM inference throughput by 2.35x to 3.47x.
Supports Llama, Falcon, and Mistral model families.
Recent updates include support for GQA and Hugging Face Transformers KV Cache quantization.
Beta optimizations for reduced latency are available in the develop branch.

Maintenance & Community

The project was accepted to ICML 2024.
Contributions are welcomed via issues or pull requests.
No specific community channels (Discord/Slack) are listed.

Licensing & Compatibility

Released under the MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The CUDA implementation requires separate installation and compilation.
Some optimizations are in beta and may require reinstallation.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

LLM inference acceleration through comprehensive KV cache management survey

Created 1 year ago

Updated 1 month ago

Starred by

Benjamin Bolte

Benjamin Bolte(Cofounder of K-Scale Labs).

KVSplit by dipampaul17

CLI tool for differentiated KV cache quantization on Apple Silicon

Created 8 months ago

Updated 7 months ago

Awesome-LLM-KV-Cache by Zefan-Cai

Curated list of LLM KV cache research papers with code

Created 1 year ago

Updated 10 months ago

Atom by efeslab

Low-bit quantization research paper for efficient LLM serving

Created 2 years ago

Updated 1 year ago

KVQuant by SqueezeAILab

Research paper on KV cache quantization for long context LLM inference

Created 1 year ago

Updated 1 year ago

Starred by

Meng Zhang

Meng Zhang(Cofounder of TabbyML).

crabml by crabml

Llama.cpp compatible inference engine in Rust

Created 2 years ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Woosuk Kwon

Woosuk Kwon(Coauthor of vLLM), and

1 more.

SqueezeLLM by SqueezeAILab

Quantization framework for efficient LLM serving (ICML 2024 paper)

Created 2 years ago

Updated 1 year ago

Starred by

Yineng Zhang

Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

Code for a NeurIPS 2024 research paper on LLM quantization

Created 1 year ago

Updated 1 year ago

omniserve by mit-han-lab

Unified inference engine for large-scale LLM serving

Created 1 year ago

Updated 10 months ago

Starred by

Zack Li

Zack Li(Cofounder of Nexa AI).

T-MAC by microsoft

Kernel library for low-bit LLM inference on CPUs using lookup tables

Created 1 year ago

Updated 7 months ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

gpu_poor by RahulSChand

CLI tool for LLM memory and throughput estimation

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Ying Sheng

Ying Sheng(Coauthor of SGLang).

GPTQModel by ModelCloud

LLM compression toolkit for accelerated CPU/GPU inference

Created 1 year ago

Updated 2 days ago

Feedback? Help us improve.