Quest by mit-han-lab

Inference framework for efficient long-context LLM inference

Created 1 year ago

363 stars

Top 77.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ying Sheng

Coauthor of SGLang

Project Summary

Quest is an open-source framework designed to accelerate long-context Large Language Model (LLM) inference by optimizing KV cache memory movement. It targets researchers and engineers working with LLMs that require processing extensive sequences, offering significant speedups without compromising accuracy.

How It Works

Quest introduces a query-aware token criticality estimation algorithm. It identifies that the importance of tokens in the KV cache is dependent on the query vector. By tracking minimal and maximal Key values within KV cache pages and using query vectors to estimate criticality, Quest selectively loads only the Top-K critical KV cache pages for attention computations. This approach reduces memory bandwidth bottlenecks, a primary cause of slowdown in long-context inference.

Quick Start & Requirements

Installation: Clone the repository with submodules (git clone --recurse-submodules), activate a conda environment (Python 3.10), and install dependencies via pip install -e .. Flash-Attention (v2.6.3) and CMake (>= 3.26.4) are required. Building custom kernels involves compiling RAFT and then the main kernels.
Prerequisites: CUDA 12.4 is recommended for efficiency evaluations.
Resources: Building kernels and dependencies may take some time.
Documentation: Links to paper, poster, and slides are provided.

Highlighted Details

Achieves up to 7.03x self-attention speedup and 2.23x end-to-end inference latency reduction.
Demonstrates negligible accuracy loss on tasks with long dependencies.
Supports Llama-3.1 and Mistral-v0.3 model families.
Includes scripts for accuracy and efficiency evaluation, as well as usage examples.

Maintenance & Community

The project is associated with mit-han-lab. It leverages and acknowledges contributions from libraries like Huggingface Transformers, FlashInfer, lm_eval, H2O, StreamingLLM, and Punica.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The project is presented as a research artifact with associated evaluation scripts. While it claims negligible accuracy loss, thorough independent validation is recommended. The README mentions a TODO for GQA model support, indicating potential incompleteness.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days