KVQuant  by SqueezeAILab

Research paper on KV cache quantization for long context LLM inference

created 1 year ago
364 stars

Top 78.4% on sourcepulse

GitHubView on GitHub
Project Summary

KVQuant is a methodology for efficient KV cache quantization, enabling significantly longer context lengths for Large Language Model (LLM) inference. It targets researchers and engineers working with LLMs who face memory bottlenecks when processing extended contexts, offering a path to serve models with millions of tokens on limited hardware.

How It Works

KVQuant addresses KV cache memory limitations by quantizing the cache to low precision. It employs several novel techniques to maintain accuracy: per-channel, pre-RoPE Key quantization to handle outliers, Non-Uniform Quantization (NUQ) for non-uniform activations, and Dense-and-Sparse Quantization to mitigate outlier impacts. These methods exploit observed patterns in KV cache values across different LLMs.

Quick Start & Requirements

The codebase is structured into five subfolders (gradients, quant, deployment, lwm, benchmarking), each with its own installation instructions. Reproducing paper results requires running the gradients and quant steps. Specific hardware requirements are not detailed, but the project demonstrates LLaMA-7B inference with 1M context on a single A100-80GB GPU and 10M context on an 8-GPU system.

Highlighted Details

  • Enables LLaMA-7B with 1M context on a single A100-80GB GPU.
  • Supports 10M context length for LLaMA-7B on an 8-GPU system.
  • Incorporates Attention Sink-Aware Quantization, leaving initial tokens in FP16 for performance gains.
  • Includes parallel topK support on GPU and kernels for parallel prompt processing.

Maintenance & Community

The project reuses components from GPTQ, GPTQ-For-LLaMA, and SqueezeLLM. A roadmap is partially outlined, with completed items including deployment code and optimized kernels. No community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, its reliance on libraries like GPTQ suggests potential compatibility considerations for commercial or closed-source use, depending on the licenses of those dependencies.

Limitations & Caveats

The README does not detail specific hardware requirements beyond GPU examples, nor does it provide explicit installation instructions for a unified setup. The project appears to be research-oriented, and the stability or production-readiness of the deployment code is not specified.

Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
4 more.

llm-awq by mit-han-lab

0.4%
3k
Weight quantization research paper for LLM compression/acceleration
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.