Quest  by mit-han-lab

Inference framework for efficient long-context LLM inference

created 1 year ago
308 stars

Top 88.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Quest is an open-source framework designed to accelerate long-context Large Language Model (LLM) inference by optimizing KV cache memory movement. It targets researchers and engineers working with LLMs that require processing extensive sequences, offering significant speedups without compromising accuracy.

How It Works

Quest introduces a query-aware token criticality estimation algorithm. It identifies that the importance of tokens in the KV cache is dependent on the query vector. By tracking minimal and maximal Key values within KV cache pages and using query vectors to estimate criticality, Quest selectively loads only the Top-K critical KV cache pages for attention computations. This approach reduces memory bandwidth bottlenecks, a primary cause of slowdown in long-context inference.

Quick Start & Requirements

  • Installation: Clone the repository with submodules (git clone --recurse-submodules), activate a conda environment (Python 3.10), and install dependencies via pip install -e .. Flash-Attention (v2.6.3) and CMake (>= 3.26.4) are required. Building custom kernels involves compiling RAFT and then the main kernels.
  • Prerequisites: CUDA 12.4 is recommended for efficiency evaluations.
  • Resources: Building kernels and dependencies may take some time.
  • Documentation: Links to paper, poster, and slides are provided.

Highlighted Details

  • Achieves up to 7.03x self-attention speedup and 2.23x end-to-end inference latency reduction.
  • Demonstrates negligible accuracy loss on tasks with long dependencies.
  • Supports Llama-3.1 and Mistral-v0.3 model families.
  • Includes scripts for accuracy and efficiency evaluation, as well as usage examples.

Maintenance & Community

The project is associated with mit-han-lab. It leverages and acknowledges contributions from libraries like Huggingface Transformers, FlashInfer, lm_eval, H2O, StreamingLLM, and Punica.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The project is presented as a research artifact with associated evaluation scripts. While it claims negligible accuracy loss, thorough independent validation is recommended. The README mentions a TODO for GQA model support, indicating potential incompleteness.

Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
38 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

applied-ai by pytorch-labs

0.3%
289
Applied AI experiments and examples for PyTorch
created 2 years ago
updated 2 months ago
Feedback? Help us improve.