unified-cache-management by ModelEngine-Group

Speed up LLM inference by managing KV cache

Created 1 year ago

300 stars

Top 88.6% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Unified Cache Manager (UCM) addresses the growing challenge of large and sparse KV caches in Large Language Models (LLMs), particularly for long sequence inference. It offers a solution by persisting and reusing KV cache data through advanced retrieval mechanisms, including prefix caching and training-free sparse attention. This framework targets engineers and researchers working with LLMs, aiming to significantly reduce inference latency and GPU memory consumption, thereby enabling more efficient handling of demanding tasks like multi-turn dialogues and long-context reasoning.

How It Works

UCM's core principle is to persist LLM KV caches and eliminate redundant computations via multiple retrieval strategies. It introduces a unified framework with pluggable sparse algorithms, built upon base classes like UcmSparseBase and KVStoreBase. This design decouples sparse algorithm implementations from external storage systems, allowing seamless integration with various storage solutions like NFS. By identifying KV cache blocks through IDs and offsets, UCM efficiently supports both sparse scenarios and prefix caching, enhancing flexibility and performance.

Quick Start & Requirements

Integration with vLLM is central to UCM's quick start. Users are directed to refer to the "Quick Start for vLLM" and "Quick Start for vLLM-Ascend" guides for setup. The project is maintained for vLLM version 0.11.0. Specific hardware (e.g., GPU, CUDA versions) and software prerequisites beyond vLLM are not detailed in the provided text.

Highlighted Details

Supports a comprehensive set of features including Prefix Cache, Cache Blend, Model Window Extrapolation, Prefill Offload, Sparse Attention, Sparse Attention Offload, and Heterogeneous PD Disaggregation.
When integrated with vLLM, UCM demonstrates a significant 3-10x reduction in inference latency across diverse scenarios.

Maintenance & Community

The project maintains both main and develop branches, both compatible with vLLM v0.11.0. Technical questions and feature requests are managed via GitHub Issues. A WeChat technical discussion group is also available, indicated by a QR code in the documentation.

Licensing & Compatibility

UCM is licensed under the MIT license with additional conditions. Users are advised to consult the LICENSE file for specific details regarding usage and restrictions. No explicit compatibility notes for commercial use or closed-source linking are provided.

Limitations & Caveats

The provided README content does not explicitly detail any limitations, alpha status, known bugs, or unsupported platforms. The project appears to be presented as a stable integration for vLLM.

unified-cache-management by ModelEngine-Group

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

compaction by adamzweiger

InfiniStore by bytedance

cartridges by HazyResearch

Quest by mit-han-lab

FlexKV by taco-project

omniserve by mit-han-lab

OSCAR by FutureMLS-Lab

triattention by WeianMao

CAG by hhhuang

RedKnot by rednote-machine-learning

LMCache by LMCache