FlexKV by taco-project

LLM inference acceleration via distributed KV cache management

Created 1 year ago

299 stars

Top 88.8% on SourcePulse

Project Summary

Summary

FlexKV addresses the critical challenge of managing KV Cache for high-performance distributed Large Language Model (LLM) inference. Developed by Tencent Cloud's TACO team, it provides a distributed KV store and multi-level cache system designed to significantly boost inference throughput and reduce latency by efficiently utilizing tiered storage. This system is particularly beneficial for large-scale LLM deployments and integrates seamlessly with popular inference frameworks.

How It Works

FlexKV employs a sophisticated three-tiered caching hierarchy: CPU memory, local SSD, and scalable distributed storage. This approach mitigates GPU VRAM limitations by offloading KVCache data. The architecture comprises three core modules: StorageEngine handles block-level KVCache storage and optional block-wise merging for optimized I/O; GlobalCacheEngine acts as the control plane, using a RadixTree for efficient prefix matching and managing cache eviction; and TransferEngine executes data transfers asynchronously using high-performance mechanisms like io_uring and GPU Direct Storage (GDS), enabling overlapping computation and data movement. Distributed KVCache reuse is facilitated via a distributed RadixTree and RDMA-based Mooncake Transfer Engine.

Quick Start & Requirements

Dependencies: Requires liburing-dev, libxxhash-dev, and libhiredis-dev.
Build: Compile using ./build.sh or ./build.sh --release.
Integration:
- vLLM: FlexKV is now a built-in component (FlexKVConnectorV1) in vLLM mainline (v0.17.2+). See docs/vllm_adapter/README_en.md.
- NVIDIA Dynamo: Integrated as a native KV Cache Offloading option. See docs/dynamo_integration/README_en.md.
- TensorRT-LLM: Support is available. See docs/trtllm_adaption/README_en.md.
API: Transitioned to a directly-callable library API (v1.0.0).

Highlighted Details

Framework Integration: Seamlessly integrated into vLLM, NVIDIA Dynamo, and TensorRT-LLM, supporting TP16 configurations.
Performance Enhancements: Features GPU Direct Storage (GDS) for direct SSD-to-GPU transfers, optimizing I/O. Supports RDMA-based Mooncake Transfer Engine for high-performance cross-node KVCache sharing.
Distributed Capabilities: Enables distributed KVCache reuse across nodes with a distributed RadixTree and lease mechanism.
Monitoring: Includes a zero-intrusion Prometheus-based runtime monitoring framework for cache hit/miss rates, memory status, and transfer statistics.

Maintenance & Community

Developed by Tencent Cloud's TACO team in collaboration with the community. The project's merger into major frameworks like vLLM and NVIDIA Dynamo signifies strong adoption and ongoing relevance. The main branch serves for rapid iteration, while release-* branches provide stable versions.

Licensing & Compatibility

Licensed under the permissive Apache-2.0 License, allowing for broad compatibility with commercial and closed-source applications.

Limitations & Caveats

The README does not explicitly detail limitations. The project is actively developed, with ongoing roadmap items focusing on further distributed query support and latency optimization, suggesting areas of continued enhancement. System-level dependencies are required for building.

FlexKV by taco-project

Explore Similar Projects

Awesome-KV-Cache-Management by TreeAI-Lab

InfiniStore by bytedance

inferrs by ericcurtin

omniserve by mit-han-lab

OSCAR by FutureMLS-Lab

candle-vllm by EricLBuehler

kvcached by ovg-project

turboquant by 0xSero

RedKnot by rednote-machine-learning

Mooncake by kvcache-ai

dynamo by ai-dynamo

LMCache by LMCache