unified-cache-management  by ModelEngine-Group

Speed up LLM inference by managing KV cache

Created 7 months ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Unified Cache Manager (UCM) addresses the growing challenge of large and sparse KV caches in Large Language Models (LLMs), particularly for long sequence inference. It offers a solution by persisting and reusing KV cache data through advanced retrieval mechanisms, including prefix caching and training-free sparse attention. This framework targets engineers and researchers working with LLMs, aiming to significantly reduce inference latency and GPU memory consumption, thereby enabling more efficient handling of demanding tasks like multi-turn dialogues and long-context reasoning.

How It Works

UCM's core principle is to persist LLM KV caches and eliminate redundant computations via multiple retrieval strategies. It introduces a unified framework with pluggable sparse algorithms, built upon base classes like UcmSparseBase and KVStoreBase. This design decouples sparse algorithm implementations from external storage systems, allowing seamless integration with various storage solutions like NFS. By identifying KV cache blocks through IDs and offsets, UCM efficiently supports both sparse scenarios and prefix caching, enhancing flexibility and performance.

Quick Start & Requirements

Integration with vLLM is central to UCM's quick start. Users are directed to refer to the "Quick Start for vLLM" and "Quick Start for vLLM-Ascend" guides for setup. The project is maintained for vLLM version 0.11.0. Specific hardware (e.g., GPU, CUDA versions) and software prerequisites beyond vLLM are not detailed in the provided text.

Highlighted Details

  • Supports a comprehensive set of features including Prefix Cache, Cache Blend, Model Window Extrapolation, Prefill Offload, Sparse Attention, Sparse Attention Offload, and Heterogeneous PD Disaggregation.
  • When integrated with vLLM, UCM demonstrates a significant 3-10x reduction in inference latency across diverse scenarios.

Maintenance & Community

The project maintains both main and develop branches, both compatible with vLLM v0.11.0. Technical questions and feature requests are managed via GitHub Issues. A WeChat technical discussion group is also available, indicated by a QR code in the documentation.

Licensing & Compatibility

UCM is licensed under the MIT license with additional conditions. Users are advised to consult the LICENSE file for specific details regarding usage and restrictions. No explicit compatibility notes for commercial use or closed-source linking are provided.

Limitations & Caveats

The provided README content does not explicitly detail any limitations, alpha status, known bugs, or unsupported platforms. The project appears to be presented as a stable integration for vLLM.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
69
Issues (30d)
9
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

GPTCache by zilliztech

0.1%
8k
Semantic cache for LLM queries, integrated with LangChain and LlamaIndex
Created 2 years ago
Updated 7 months ago
Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.4%
7k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 17 hours ago
Feedback? Help us improve.