Discover and explore top open-source AI tools and projects—updated daily.
taco-projectLLM inference acceleration via distributed KV cache management
Top 97.7% on SourcePulse
Summary
FlexKV addresses the critical challenge of managing KV Cache for high-performance distributed Large Language Model (LLM) inference. Developed by Tencent Cloud's TACO team, it provides a distributed KV store and multi-level cache system designed to significantly boost inference throughput and reduce latency by efficiently utilizing tiered storage. This system is particularly beneficial for large-scale LLM deployments and integrates seamlessly with popular inference frameworks.
How It Works
FlexKV employs a sophisticated three-tiered caching hierarchy: CPU memory, local SSD, and scalable distributed storage. This approach mitigates GPU VRAM limitations by offloading KVCache data. The architecture comprises three core modules: StorageEngine handles block-level KVCache storage and optional block-wise merging for optimized I/O; GlobalCacheEngine acts as the control plane, using a RadixTree for efficient prefix matching and managing cache eviction; and TransferEngine executes data transfers asynchronously using high-performance mechanisms like io_uring and GPU Direct Storage (GDS), enabling overlapping computation and data movement. Distributed KVCache reuse is facilitated via a distributed RadixTree and RDMA-based Mooncake Transfer Engine.
Quick Start & Requirements
liburing-dev, libxxhash-dev, and libhiredis-dev../build.sh or ./build.sh --release.FlexKVConnectorV1) in vLLM mainline (v0.17.2+). See docs/vllm_adapter/README_en.md.docs/dynamo_integration/README_en.md.docs/trtllm_adaption/README_en.md.Highlighted Details
Maintenance & Community
Developed by Tencent Cloud's TACO team in collaboration with the community. The project's merger into major frameworks like vLLM and NVIDIA Dynamo signifies strong adoption and ongoing relevance. The main branch serves for rapid iteration, while release-* branches provide stable versions.
Licensing & Compatibility
Licensed under the permissive Apache-2.0 License, allowing for broad compatibility with commercial and closed-source applications.
Limitations & Caveats
The README does not explicitly detail limitations. The project is actively developed, with ongoing roadmap items focusing on further distributed query support and latency optimization, suggesting areas of continued enhancement. System-level dependencies are required for building.
12 hours ago
Inactive
kvcache-ai
ai-dynamo
LMCache