KVQuant by SqueezeAILab

Research paper on KV cache quantization for long context LLM inference

Created 1 year ago

397 stars

Top 72.7% on SourcePulse

Project Summary

KVQuant is a methodology for efficient KV cache quantization, enabling significantly longer context lengths for Large Language Model (LLM) inference. It targets researchers and engineers working with LLMs who face memory bottlenecks when processing extended contexts, offering a path to serve models with millions of tokens on limited hardware.

How It Works

KVQuant addresses KV cache memory limitations by quantizing the cache to low precision. It employs several novel techniques to maintain accuracy: per-channel, pre-RoPE Key quantization to handle outliers, Non-Uniform Quantization (NUQ) for non-uniform activations, and Dense-and-Sparse Quantization to mitigate outlier impacts. These methods exploit observed patterns in KV cache values across different LLMs.

Quick Start & Requirements

The codebase is structured into five subfolders (gradients, quant, deployment, lwm, benchmarking), each with its own installation instructions. Reproducing paper results requires running the gradients and quant steps. Specific hardware requirements are not detailed, but the project demonstrates LLaMA-7B inference with 1M context on a single A100-80GB GPU and 10M context on an 8-GPU system.

Highlighted Details

Enables LLaMA-7B with 1M context on a single A100-80GB GPU.
Supports 10M context length for LLaMA-7B on an 8-GPU system.
Incorporates Attention Sink-Aware Quantization, leaving initial tokens in FP16 for performance gains.
Includes parallel topK support on GPU and kernels for parallel prompt processing.

Maintenance & Community

The project reuses components from GPTQ, GPTQ-For-LLaMA, and SqueezeLLM. A roadmap is partially outlined, with completed items including deployment code and optimized kernels. No community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, its reliance on libraries like GPTQ suggests potential compatibility considerations for commercial or closed-source use, depending on the licenses of those dependencies.

Limitations & Caveats

The README does not detail specific hardware requirements beyond GPU examples, nor does it provide explicit installation instructions for a unified setup. The project appears to be research-oriented, and the stability or production-readiness of the deployment code is not specified.

KVQuant by SqueezeAILab

Explore Similar Projects

Awesome-LLM-Quantization by pprp

GPTQ-triton by fpgaminer

Atom by efeslab

KIVI by jy-yuan

VPTQ by microsoft

quip-sharp by Cornell-RelaxML

QuaRot by spcl

omniserve by mit-han-lab

gpu_poor by RahulSChand

ik_llama.cpp by ikawrakow

llm-awq by mit-han-lab

bitsandbytes by bitsandbytes-foundation