Inference framework for efficient long-context LLM inference
Top 88.2% on sourcepulse
Quest is an open-source framework designed to accelerate long-context Large Language Model (LLM) inference by optimizing KV cache memory movement. It targets researchers and engineers working with LLMs that require processing extensive sequences, offering significant speedups without compromising accuracy.
How It Works
Quest introduces a query-aware token criticality estimation algorithm. It identifies that the importance of tokens in the KV cache is dependent on the query vector. By tracking minimal and maximal Key values within KV cache pages and using query vectors to estimate criticality, Quest selectively loads only the Top-K critical KV cache pages for attention computations. This approach reduces memory bandwidth bottlenecks, a primary cause of slowdown in long-context inference.
Quick Start & Requirements
git clone --recurse-submodules
), activate a conda
environment (Python 3.10), and install dependencies via pip install -e .
. Flash-Attention (v2.6.3) and CMake (>= 3.26.4) are required. Building custom kernels involves compiling RAFT and then the main kernels.Highlighted Details
Maintenance & Community
The project is associated with mit-han-lab. It leverages and acknowledges contributions from libraries like Huggingface Transformers, FlashInfer, lm_eval, H2O, StreamingLLM, and Punica.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source integration.
Limitations & Caveats
The project is presented as a research artifact with associated evaluation scripts. While it claims negligible accuracy loss, thorough independent validation is recommended. The README mentions a TODO for GQA model support, indicating potential incompleteness.
3 weeks ago
1 week