Discover and explore top open-source AI tools and projects—updated daily.
ModelEngine-GroupSpeed up LLM inference by managing KV cache
Top 99.8% on SourcePulse
<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Unified Cache Manager (UCM) addresses the growing challenge of large and sparse KV caches in Large Language Models (LLMs), particularly for long sequence inference. It offers a solution by persisting and reusing KV cache data through advanced retrieval mechanisms, including prefix caching and training-free sparse attention. This framework targets engineers and researchers working with LLMs, aiming to significantly reduce inference latency and GPU memory consumption, thereby enabling more efficient handling of demanding tasks like multi-turn dialogues and long-context reasoning.
How It Works
UCM's core principle is to persist LLM KV caches and eliminate redundant computations via multiple retrieval strategies. It introduces a unified framework with pluggable sparse algorithms, built upon base classes like UcmSparseBase and KVStoreBase. This design decouples sparse algorithm implementations from external storage systems, allowing seamless integration with various storage solutions like NFS. By identifying KV cache blocks through IDs and offsets, UCM efficiently supports both sparse scenarios and prefix caching, enhancing flexibility and performance.
Quick Start & Requirements
Integration with vLLM is central to UCM's quick start. Users are directed to refer to the "Quick Start for vLLM" and "Quick Start for vLLM-Ascend" guides for setup. The project is maintained for vLLM version 0.11.0. Specific hardware (e.g., GPU, CUDA versions) and software prerequisites beyond vLLM are not detailed in the provided text.
Highlighted Details
Maintenance & Community
The project maintains both main and develop branches, both compatible with vLLM v0.11.0. Technical questions and feature requests are managed via GitHub Issues. A WeChat technical discussion group is also available, indicated by a QR code in the documentation.
Licensing & Compatibility
UCM is licensed under the MIT license with additional conditions. Users are advised to consult the LICENSE file for specific details regarding usage and restrictions. No explicit compatibility notes for commercial use or closed-source linking are provided.
Limitations & Caveats
The provided README content does not explicitly detail any limitations, alpha status, known bugs, or unsupported platforms. The project appears to be presented as a stable integration for vLLM.
23 hours ago
Inactive
FMInference
zilliztech
ai-dynamo
LMCache