KVQuant  by SqueezeAILab

Research paper on KV cache quantization for long context LLM inference

Created 1 year ago
381 stars

Top 74.8% on SourcePulse

GitHubView on GitHub
Project Summary

KVQuant is a methodology for efficient KV cache quantization, enabling significantly longer context lengths for Large Language Model (LLM) inference. It targets researchers and engineers working with LLMs who face memory bottlenecks when processing extended contexts, offering a path to serve models with millions of tokens on limited hardware.

How It Works

KVQuant addresses KV cache memory limitations by quantizing the cache to low precision. It employs several novel techniques to maintain accuracy: per-channel, pre-RoPE Key quantization to handle outliers, Non-Uniform Quantization (NUQ) for non-uniform activations, and Dense-and-Sparse Quantization to mitigate outlier impacts. These methods exploit observed patterns in KV cache values across different LLMs.

Quick Start & Requirements

The codebase is structured into five subfolders (gradients, quant, deployment, lwm, benchmarking), each with its own installation instructions. Reproducing paper results requires running the gradients and quant steps. Specific hardware requirements are not detailed, but the project demonstrates LLaMA-7B inference with 1M context on a single A100-80GB GPU and 10M context on an 8-GPU system.

Highlighted Details

  • Enables LLaMA-7B with 1M context on a single A100-80GB GPU.
  • Supports 10M context length for LLaMA-7B on an 8-GPU system.
  • Incorporates Attention Sink-Aware Quantization, leaving initial tokens in FP16 for performance gains.
  • Includes parallel topK support on GPU and kernels for parallel prompt processing.

Maintenance & Community

The project reuses components from GPTQ, GPTQ-For-LLaMA, and SqueezeLLM. A roadmap is partially outlined, with completed items including deployment code and optimized kernels. No community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, its reliance on libraries like GPTQ suggests potential compatibility considerations for commercial or closed-source use, depending on the licenses of those dependencies.

Limitations & Caveats

The README does not detail specific hardware requirements beyond GPU examples, nor does it provide explicit installation instructions for a unified setup. The project appears to be research-oriented, and the stability or production-readiness of the deployment code is not specified.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI) and Jeremy Howard Jeremy Howard(Cofounder of fast.ai).

QuaRot by spcl

0.5%
424
Code for a NeurIPS 2024 research paper on LLM quantization
Created 1 year ago
Updated 9 months ago
Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.